Data Annotation Infrastructure: Building a Scalable Pipeline with CVAT and B2 Object Storage

Building a Scalable Pipeline with CVAT and B2

Every computer vision model is a reflection of the data it was trained on. The precision of the labels, the consistency across annotators, the coverage of edge cases. Get the data right and the model performs. Get it wrong and no amount of architecture or compute will compensate.

AI-assisted annotation tools have made it possible to label at a speed and scale that was unthinkable a few years ago. Teams that used to spend weeks on manual labeling now generate annotations automatically and refine them. That acceleration opens up real opportunity, but it also raises the bar for the infrastructure underneath the annotation pipeline. The architecture needs to keep pace with the volume of data now moving through it.

We have spent the last decade building AI workflows and one pattern shows up consistently: the teams that treat annotation as infrastructure from day one outperform the ones that bolt it on later. The difference is not just tooling. It is the architectural decisions underneath, particularly around storage, that determine whether an annotation operation scales or stalls.

Annotation within the ML pipeline

A production ML pipeline spans data ingestion, preprocessing, training, serving, and monitoring. Annotation sits within the data layer, but it is the stage where several consequential storage decisions converge: how raw data is stored, who can access it, how long it is retained, and how it flows between labeling, training, and evaluation. Getting storage right at this layer strengthens every stage downstream.

Data gravity. A single autonomous driving project can produce terabytes of camera and LiDAR data before a single label is applied. This data needs to live somewhere durable and accessible before, during, and after annotation. It rarely moves once it lands.

Collaboration. Labeling teams may be internal, external, or a mix. Quality reviewers need the same data access as annotators. The data layer has to support concurrent access across roles and geographies without creating bottlenecks or redundant copies.

Lifecycle persistence. Labeled datasets are reused across training runs, refined as models improve, and versioned as labeling standards evolve. Storage needs to retain raw data alongside annotations for months or years.

Regulatory constraints. In healthcare, automotive, and defense, access controls around training data are subject to compliance requirements. Encryption, scoped credentials, and auditability are non-negotiable.

The teams that design for these requirements upfront build data operations that scale cleanly across the entire pipeline.

CVAT as the annotation layer

CVAT (Computer Vision Annotation Tool) started as an internal tool at Intel in 2017, was open sourced in 2018, and spun out as CVAT.ai Corporation in 2022. Millions of users use it today, and for good reason.

Annotation breadth. CVAT covers the full range of label types: bounding boxes, polygons, polylines, keypoints, skeletons, cuboids, brush-based masks, and tags. It handles images, video, and 3D point clouds natively, so teams working across object detection, segmentation, and pose estimation stay on one platform instead of stitching separate tools together.

AI-assisted labeling. Beyond SAM 3 and YOLO, CVAT supports custom models through its AI Agents framework, which lets teams plug their own inference endpoints directly into the labeling workflow. For video, SAM 2-powered tracking propagates annotations across frames. The shift from manual annotation to review-and-correct workflows is where the real throughput gains happen.

Deployment flexibility. CVAT ships in three editions: Community (free, self-hosted), Online (managed SaaS), and Enterprise (on-premises with SSO, RBAC, and audit logging). You can start hosted and move to self-hosted as data governance needs evolve.

Pipeline integration. CVAT exposes a REST API with a Python SDK and CLI, so annotation tasks can be created, populated, and exported programmatically. For teams building CI/CD-style retraining loops, this is what makes CVAT a pipeline component rather than a standalone manual step.

Backblaze B2: The storage layer that compounds

Annotation tools get the attention, but the storage layer is where the architecture compounds over time. CVAT supports native cloud storage integration through S3-compatible buckets, and Backblaze B2 fits directly into that connector. Configure your B2 endpoint, bucket name, and application key credentials, and CVAT treats it as native cloud storage. 

This architecture decouples compute from storage. CVAT handles annotation logic while B2 owns durability and access, and you can scale, migrate, or replace either independently. Because B2 is S3-compatible, other stages of the ML pipeline, from training scripts to data validation to orchestration, access the same data directly from the same bucket CVAT writes to. No intermediate exports. No dataset copies drifting out of sync.

Retention is where the storage decision pays off most. Annotation datasets have long lifecycles, and footprints accumulate fast across concurrent projects. B2’s storage economics let teams hold large datasets across the full model development lifecycle without cost becoming the limiting factor. This is especially relevant for video-heavy projects and multi-sensor datasets where raw data runs into tens of terabytes.

Access control matters too. B2 application keys can be scoped to individual buckets or file prefixes with granular permissions and optional expiration, so access boundaries stay clean across annotators, labeling services, and downstream training pipelines. The CVAT integration guide for Backblaze B2 walks through the full setup, and the CVAT cloud storage documentation covers access permissions, manifest files, and endpoint routing.

Building the pipeline that scales into what’s next

The teams building the best computer vision models are not just choosing better algorithms. They are investing in the annotation infrastructure that feeds those algorithms: the right labeling platform, the right storage architecture, and the right cost structure to sustain it all as data grows.

This becomes even more critical as the field moves toward world models. NVIDIA’s Cosmos platform has already been downloaded over two million times. World Labs launched Marble for commercial 3D world generation. DeepMind’s Genie 3 produces interactive 3D environments in real time. Yann LeCun left Meta to start AMI Labs with the explicit goal of building AI systems that understand physics, not just predict text. These systems need training data that goes far beyond today’s labeled images: synchronized multi-sensor captures, physics-aware video, dense 3D point cloud annotations. The data volumes and annotation complexity will dwarf what most teams work with today, and the infrastructure underneath will need to handle it.

The annotation pipeline you build now is the one that will carry you into that future. Both layers are worth getting right early.

You can get started with CVAT at cvat.ai and with Backblaze B2 at backblaze.com/cloud-storage.

About Jeronimo De Leon

Jeronimo De Leon is a seasoned product management leader with over 10 years of experience driving AI-driven innovation across enterprise and startup environments. Currently serving as Senior Product Manager, AI at Backblaze, he leads the development of AI/ML features, focuses on how Backblaze enhances the AI data lifecycle for customers' MLOps architectures, and implements AI tools and agents to optimize internal operations.