Training with Data Lineage¶

Instead of downloading an entire dataset and writing scripts to filter it locally, you can use TQL queries directly on a managed table to create focused training subsets. The same query always produces the same view at a given commit, giving you reproducible data lineage for every experiment.

Objective¶

Ingest a COCO-style dataset into Deeplake, use TQL to select only the categories relevant to a driving scenario, and train a Faster R-CNN detector on that filtered subset. All without downloading or duplicating data.

Prerequisites¶

Deeplake SDK: pip install deeplake
Deep Learning framework: pip install torch torchvision
A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code¶

Python SDK

import io
import json
import torch
import torchvision
from PIL import Image
from torchvision import transforms
from deeplake import Client

# 1. Setup
client = Client()

# 2. Ingest COCO (skip if already ingested)
from deeplake.managed.formats import CocoPanoptic
client.ingest("coco_train", format=CocoPanoptic(
    images_dir="coco/train2017",
    masks_dir="coco/panoptic_train2017",
    annotations="coco/annotations/panoptic_train2017.json",
))

# 3. Open the table and filter with TQL
ds = client.open_table("coco_train")
view = ds.query(
    "SELECT * WHERE category IN "
    "('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)

# 4. Verify the filtered subset
result = client.query("SELECT COUNT(*) as count FROM coco_train "
                      "WHERE category IN "
                      "('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
                      "'traffic light', 'stop sign')")
print(f"Training on {result} samples")

# 5. Parse detection batches from ds.batches()
to_tensor = transforms.ToTensor()

def parse_detection_batch(batch):
    images = []
    targets = []
    for i in range(len(batch["image"])):
        img = Image.open(io.BytesIO(batch["image"][i])).convert("RGB")
        images.append(to_tensor(img))

        segs = json.loads(batch["segments_info"][i]) if isinstance(batch["segments_info"][i], str) else batch["segments_info"][i]
        boxes, labels = [], []
        for seg in segs:
            bbox = seg["bbox"]
            boxes.append([bbox[0], bbox[1], bbox[0] + bbox[2], bbox[1] + bbox[3]])
            labels.append(seg["category_id"])

        targets.append({
            "boxes": torch.tensor(boxes, dtype=torch.float32) if boxes else torch.zeros((0, 4)),
            "labels": torch.tensor(labels, dtype=torch.int64) if labels else torch.zeros(0, dtype=torch.int64),
        })
    return images, targets

# 6. Train Faster R-CNN on the filtered subset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="DEFAULT")
model.to(device)
model.train()

optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9)

for epoch in range(3):
    n_batches = 0
    for batch in view.batches(batch_size=16):
        images, targets = parse_detection_batch(batch)
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        n_batches += 1
        if n_batches % 50 == 0:
            print(f"Epoch {epoch}, Batch {n_batches}, Loss: {losses.item():.4f}")

Step-by-Step Breakdown¶

1. Ingest the Dataset¶

COCO requires the CocoPanoptic format class because its nested annotation structure (bboxes, categories, areas) cannot be auto-converted by the _huggingface shortcut. Download COCO locally first, then ingest:

from deeplake.managed.formats import CocoPanoptic

client.ingest("coco_train", format=CocoPanoptic(
    images_dir="path/to/images",
    masks_dir="path/to/masks",
    annotations="path/to/annotations.json",
))

2. Query and Filter with TQL¶

Open the table and run a TQL query to create a filtered view. This does not copy or move any data. It returns a lightweight view that points to the matching rows.

ds = client.open_table("coco_train")
view = ds.query(
    "SELECT * WHERE category IN "
    "('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)

You can verify the size of your subset with a count query:

result = client.query("SELECT COUNT(*) as count FROM coco_train "
                      "WHERE category IN "
                      "('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
                      "'traffic light', 'stop sign')")
print(result)

3. Train on the Filtered Subset¶

Pass the filtered view directly to a PyTorch DataLoader. Deeplake streams only the matching rows, so the training loop never touches irrelevant data.

from torch.utils.data import DataLoader

train_loader = DataLoader(
    view.pytorch(),
    batch_size=16,
    shuffle=True,
    num_workers=4,
)

From here, the training loop is standard PyTorch, nothing Deeplake-specific.

4. Track Lineage¶

Every TQL query is deterministic at a given dataset commit. This means:

Reproducibility: re-running the same query on the same commit returns exactly the same rows.
Auditability: you can log the query string alongside your model checkpoint to record which data was used.
Iteration: changing the filter (e.g., adding 'pedestrian') creates a new view instantly, without re-downloading anything.

No external metadata store or DVC-style tracking is needed. The query is the lineage.

What to try next¶

GPU-Streaming Pipeline: optimize throughput for large-scale training.
Massive Ingestion: prepare large-scale datasets for training.
Image Classification Training: a simpler single-label training example.