Skip to content

Data Formats

Format classes standardize domain-specific datasets into column-oriented batches for ingestion. Use them when your data has a specific structure (COCO annotations, robotics episodes, etc.) that needs custom processing.

How formats work

A format class has up to four methods:

Method Required Purpose
normalize() Yes Yields dict batches ({column: [values]})
schema() No Declares Deeplake storage types ("BINARY", "TEXT")
pg_schema() No Declares PostgreSQL domain types ("IMAGE", "SEGMENT_MASK")
image_columns() No Lists columns for automatic thumbnail generation

Pass a format instance to client.ingest():

client.ingest("my_table", data, format=MyFormat())

Built-in formats

CocoPanoptic

Processes COCO panoptic segmentation datasets. Generates columns for image data, segmentation masks (PNG), dimensions, and segment metadata as JSON.

from deeplake.managed.formats import CocoPanoptic

client.ingest("panoptic", {
    "path": ["coco_panoptic/"]
}, format=CocoPanoptic())

Coco

Handles COCO detection and caption annotations. Generates combined segmentation masks, bounding boxes, captions, and category definitions.

from deeplake.managed.formats import Coco

client.ingest("detection", {
    "path": ["coco_detection/"]
}, format=Coco())

LeRobot

Three-table design for robotic learning datasets:

  • Tasks: task indices and descriptions
  • Frames: per-frame state/action vectors with episode and task references
  • Episodes: episode-level metadata with video streams from multiple cameras
from deeplake.managed.formats import LeRobot

client.ingest("robot_data", {
    "path": ["lerobot_dataset/"]
}, format=LeRobot())

Custom formats

Create your own format class for domain-specific data:

from deeplake.managed.formats import Format

class CsvWithImages(Format):
    def __init__(self, csv_path, image_dir):
        self.csv_path = csv_path
        self.image_dir = image_dir

    def schema(self):
        return {"image": "BINARY"}

    def pg_schema(self):
        return {"image": "IMAGE"}

    def image_columns(self):
        return ["image"]

    def normalize(self):
        import csv
        with open(self.csv_path) as f:
            reader = csv.DictReader(f)
            batch = {"label": [], "image": []}
            for row in reader:
                with open(f"{self.image_dir}/{row['filename']}", "rb") as img:
                    batch["image"].append(img.read())
                batch["label"].append(row["label"])
                if len(batch["label"]) >= 100:
                    yield batch
                    batch = {"label": [], "image": []}
            if batch["label"]:
                yield batch

client.ingest("labeled_images", {}, format=CsvWithImages("labels.csv", "images/"))

Guidelines for normalize()

  • Yield dicts with column names as keys and equal-length lists as values
  • Keep consistent column names across all batches
  • Batch 100-1000 rows per yield to manage memory
  • Raise exceptions for critical errors, log warnings for recoverable ones

Next steps

  • Data Types: full type reference and schema inference
  • Ingestion: client.ingest() API and chunking strategies