Data Formats¶
Format classes standardize domain-specific datasets into column-oriented batches for ingestion. Use them when your data has a specific structure (COCO annotations, robotics episodes, etc.) that needs custom processing.
How formats work¶
A format class has up to four methods:
| Method | Required | Purpose |
|---|---|---|
normalize() |
Yes | Yields dict batches ({column: [values]}) |
schema() |
No | Declares Deeplake storage types ("BINARY", "TEXT") |
pg_schema() |
No | Declares PostgreSQL domain types ("IMAGE", "SEGMENT_MASK") |
image_columns() |
No | Lists columns for automatic thumbnail generation |
Pass a format instance to client.ingest():
Built-in formats¶
CocoPanoptic¶
Processes COCO panoptic segmentation datasets. Generates columns for image data, segmentation masks (PNG), dimensions, and segment metadata as JSON.
from deeplake.managed.formats import CocoPanoptic
client.ingest("panoptic", {
"path": ["coco_panoptic/"]
}, format=CocoPanoptic())
Coco¶
Handles COCO detection and caption annotations. Generates combined segmentation masks, bounding boxes, captions, and category definitions.
from deeplake.managed.formats import Coco
client.ingest("detection", {
"path": ["coco_detection/"]
}, format=Coco())
LeRobot¶
Three-table design for robotic learning datasets:
- Tasks: task indices and descriptions
- Frames: per-frame state/action vectors with episode and task references
- Episodes: episode-level metadata with video streams from multiple cameras
from deeplake.managed.formats import LeRobot
client.ingest("robot_data", {
"path": ["lerobot_dataset/"]
}, format=LeRobot())
Custom formats¶
Create your own format class for domain-specific data:
from deeplake.managed.formats import Format
class CsvWithImages(Format):
def __init__(self, csv_path, image_dir):
self.csv_path = csv_path
self.image_dir = image_dir
def schema(self):
return {"image": "BINARY"}
def pg_schema(self):
return {"image": "IMAGE"}
def image_columns(self):
return ["image"]
def normalize(self):
import csv
with open(self.csv_path) as f:
reader = csv.DictReader(f)
batch = {"label": [], "image": []}
for row in reader:
with open(f"{self.image_dir}/{row['filename']}", "rb") as img:
batch["image"].append(img.read())
batch["label"].append(row["label"])
if len(batch["label"]) >= 100:
yield batch
batch = {"label": [], "image": []}
if batch["label"]:
yield batch
client.ingest("labeled_images", {}, format=CsvWithImages("labels.csv", "images/"))
Guidelines for normalize()¶
- Yield dicts with column names as keys and equal-length lists as values
- Keep consistent column names across all batches
- Batch 100-1000 rows per yield to manage memory
- Raise exceptions for critical errors, log warnings for recoverable ones
Next steps¶
- Data Types: full type reference and schema inference
- Ingestion:
client.ingest()API and chunking strategies