
Today’s AI models consume much more than text—everything from product images to video from surveillance feeds to audio from customer calls to metadata spread across an ever-expanding set of systems. These multimodal datasets drive everything from computer vision pipelines to customer service automation. But as they scale, the underlying infrastructure starts to creak.
Costs can become unpredictable. Data fragments across S3 buckets, HDFS clusters, and local drives. Maintaining cross-modal alignment, i.e. ensuring that media files stay linked to their labels, embeddings, and annotations, becomes a bottleneck that slows development to a crawl.This article outlines a practical path forward: how to migrate multimodal training data using proven open-source tools, and how Pixeltable helps unify and index that data for training once it lands in Backblaze B2.
Moving multimodal training data: Practical open source software (OSS) tools that do the heavy lifting
Before you can train on consolidated data, you need to get it all into one place. These three open-source tools handle the migration work, each addressing a different piece of the puzzle.
Apache NiFi for moving large media reliably
When your dataset includes terabytes of video files, thousands of high-resolution images, or large binary assets like LIDAR scans, you need something more robust than a shell script. Apache NiFi is purpose-built for moving large media files at scale.
NiFi provides:
- Flow control and retry logic that handle network interruptions gracefully, which is essential when transferring terabytes of data over hours or days.
- Data provenance tracking that records exactly which files moved where and when, making it possible to debug issues without guessing.
- A visual workflow designer that lets you build and monitor data flows without writing custom code.
For multimodal datasets where media volume dominates, NiFi ensures files arrive intact and trackable. Check the Apache NiFi User Guide to get started with building your first data flow.
Airbyte for syncing structured and semi-structured metadata
Media files are only half the story. Annotations, labels, captions, transcripts, and database records provide the context that makes raw media useful for training. Airbyte excels at moving this structured and semi-structured metadata.
Airbyte handles:
- Schema consistency when pulling metadata from multiple sources, ensuring annotation formats don’t drift between your labeling platform, your CRM, and your feature store.
- Incremental syncs that only transfer changed records, avoiding unnecessary data movement as your datasets grow.
- Multiple data systems via a broad catalog of connectors for databases, SaaS platforms, file formats, and cloud storage services.
Unlike NiFi, which focuses on raw file movement, Airbyte understands data schemas and transformations. Use it to keep your metadata in sync across systems. The Airbyte documentation provides setup guides for most common data sources.
lakeFS for versioning for reproducible training
After moving media via NiFi and metadata via Airbyte, you need a way to snapshot the entire dataset so you can reproduce training runs six months later. lakeFS brings Git-like version control to object storage.
lakeFS enables:
- Branching and snapshots of entire datasets without copying data. You can create a branch, run an experiment, and merge or discard the results.
- Atomic commits that ensure media, metadata, and derived features stay aligned as your corpus evolves.
- Zero-copy clones that let multiple teams work on isolated versions of production data without storage overhead.
lakeFS acts as a version control layer on top of storage like Backblaze B2, tracking changes without duplicating objects. When a training run produces a new model, you can tag the exact dataset version that went into it. The lakeFS quickstart guide walks through creating your first repository and branch.
After migration, the hard part begins: Making the dataset usable
Moving data into object storage solves logistics, not usability. Even in B2, your media files, labels, and derived features remain scattered—images in one prefix, annotations in another, embeddings in a third. Training code becomes a tangle of custom loaders that stitch everything together, break when datasets change, and consume more engineering time than model tuning.
Where Pixeltable fits
Pixeltable provides the missing layer between migrated storage and training-ready data. It’s a declarative data infrastructure specifically designed for multimodal AI applications.
Here’s what Pixeltable does:
- Unifies media and metadata into a single table interface: images, video frames, audio clips, and their associated labels, embeddings, and annotations live in one queryable structure.
- Stores computed results automatically. Run OCR on documents, generate CLIP embeddings for images, or extract audio transcripts once, and Pixeltable caches the results for reuse.
- References Backblaze B2 objects directly without copying data. Files stay in Backblaze B2, and Pixeltable maintains pointers and metadata in a local Postgres instance. Pixeltable automatically caches the files locally on access, and can write media files back to B2 (see our project for examples: https://github.com/backblaze-b2-samples/b2-pixeltable-multimodal-data).
- Supports built-in transforms like embedding generation, image captioning, and OCR with lazy evaluation. Define transformations once, and they run incrementally as new data arrives.
Instead of maintaining custom loaders and indexing scripts, you define a schema once. Pixeltable handles orchestration, caching, and queries. The result is a training dataset you can slice, filter, and feed directly into PyTorch DataLoaders or Hugging Face Datasets.
Check the Pixeltable documentation to see how tables, computed columns, and queries work in practice.
A practical end-to-end workflow
Here’s how these tools fit together in a real-world pipeline:
1. Move media via NiFi → Backblaze B2
Set up an Apache NiFi flow to transfer images, video files, or other large binaries from your current storage (on-premise NAS, another cloud provider, or local drives) to a Backblaze B2 bucket. Configure retry logic and provenance tracking so you can verify every file arrived.
Use NiFi processors like GetFile, PutS3Object, and RouteOnAttribute to handle file movement and error routing. The Backblaze B2 Cloud Storage S3-compatible API works seamlessly with NiFi’s S3 processors.
2. Sync metadata via Airbyte
Configure Airbyte to pull annotations, labels, captions, and database records from your labeling tool, feature store, or other sources. Set up connections to sync metadata incrementally as it changes. If annotations live in Postgres and captions come from a cloud-based labeling platform, Airbyte normalizes both into a consistent schema in Backblaze B2 or a dedicated metadata store.
3. Create a lakeFS branch to snapshot the dataset
Initialize a lakeFS repository pointing to your Backblaze B2 bucket. Create a branch to isolate this version of the dataset. If something goes wrong during training, you can roll back or compare versions. Use the lakeFS CLI or Python client to create branches and commits programmatically.
4. Define a Pixeltable schema referencing B2 objects + synced metadata
In Pixeltable, create a table with columns for image paths (pointing to Backblaze B2), labels, captions, and any other metadata fields. Import your data so each row represents one training example: one image, its label, its caption, and any associated metadata.Pixeltable doesn’t copy image files—it stores references and metadata, automatically caching the files locally on access. The images stay in Backblaze. The Pixeltable Tables guide explains how to create tables with multimodal column types and import data from external sources.
5. Run transforms (embeddings, captions, OCR) inside Pixeltable
Define computed columns for embeddings, captions, or OCR results. Pixeltable’s computed columns run transformations lazily as data is queried or when you explicitly trigger computation.
For example, you can add CLIP embeddings using Pixeltable’s built-in Hugging Face integration, or generate AI captions using OpenAI’s vision API. Once defined, these columns compute incrementally—new images trigger automatic processing without reprocessing the entire dataset.
The Pixeltable API reference documents all available functions for common operations like embedding generation, image processing, and text analysis.
6. Query or filter the unified dataset
Use Pixeltable’s query interface to filter, sort, and slice your data. For example, find all images labeled “cat” with embeddings similar to a reference image. Or extract rows where captions mention “outdoor” and timestamps fall within a specific range.
7. Feed batches directly into PyTorch/Hugging Face
Export data from Pixeltable into PyTorch DataLoaders or Hugging Face Datasets format for training. Pixeltable handles batching, shuffling, and data access so your training loop stays clean.
The Pixeltable documentation covers various export formats and integrations with popular ML frameworks, allowing you to avoid intermediate export steps and maintain a streamlined workflow from data preparation to model training.
From fragmented storage to production-ready training data
Multimodal AI datasets don’t have to be a maintenance nightmare. By chaining together proven open-source tools—NiFi and Airbyte for migration, lakeFS for versioning, and Pixeltable for unified access—you can turn scattered files and metadata into queryable training assets.
Once data lands in Backblaze B2, this stack eliminates the custom glue code, brittle loaders, and alignment issues that typically slow down training workflows. Your team gets reproducible datasets, clean interfaces, and more time for model development instead of infrastructure firefighting.
Ready to get started? Check out the Backblaze B2 documentation to set up your object storage, and explore Pixeltable’s examples to see multimodal workflows in action.