Skip to content

Training with Data Lineage

Instead of downloading an entire dataset and writing scripts to filter it locally, you can use TQL queries directly on a managed table to create focused training subsets. The same query always produces the same view at a given commit, giving you reproducible data lineage for every experiment.

Objective

Ingest a COCO-style dataset into Deeplake, use TQL to select only the categories relevant to a driving scenario, and train a Faster R-CNN detector on that filtered subset. All without downloading or duplicating data.

Prerequisites

  • Deeplake SDK: pip install deeplake
  • Deep Learning framework: pip install torch torchvision
  • A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code

import torch
import torchvision
from torch.utils.data import DataLoader
from deeplake import Client

# 1. Setup
client = Client()

# 2. Ingest COCO from HuggingFace (skip if already ingested)
client.ingest("coco_train", {"_huggingface": "detection-datasets/coco"})

# 3. Open the table and filter with TQL
ds = client.open_table("coco_train")
view = ds.query(
    "SELECT * WHERE category IN "
    "('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)

# 4. Verify the filtered subset
result = client.query("SELECT COUNT(*) as count FROM coco_train "
                      "WHERE category IN "
                      "('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
                      "'traffic light', 'stop sign')")
print(f"Training on {result} samples")

# 5. Create a DataLoader from the filtered view
train_loader = DataLoader(
    view.pytorch(),
    batch_size=16,
    shuffle=True,
    num_workers=4,
)

# 6. Train Faster R-CNN on the subset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="DEFAULT")
model.to(device)
model.train()

optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9)

for epoch in range(3):
    for i, batch in enumerate(train_loader):
        images = [img.to(device) for img in batch["image"]]
        targets = [
            {"boxes": t["boxes"].to(device), "labels": t["labels"].to(device)}
            for t in batch["annotations"]
        ]

        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        if i % 50 == 0:
            print(f"Epoch {epoch}, Batch {i}, Loss: {losses.item():.4f}")

Step-by-Step Breakdown

1. Ingest the Dataset

You can load COCO directly from HuggingFace with a single call. Deeplake stores the data in a managed table. No local download required.

client.ingest("coco_train", {"_huggingface": "detection-datasets/coco"})

If you already have COCO annotations locally, use the CocoPanoptic format instead:

from deeplake.managed.formats import CocoPanoptic

client.ingest("coco_train", format=CocoPanoptic(
    images_dir="path/to/images",
    masks_dir="path/to/masks",
    annotations="path/to/annotations.json",
))

2. Query and Filter with TQL

Open the table and run a TQL query to create a filtered view. This does not copy or move any data. It returns a lightweight view that points to the matching rows.

ds = client.open_table("coco_train")
view = ds.query(
    "SELECT * WHERE category IN "
    "('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)

You can verify the size of your subset with a count query:

result = client.query("SELECT COUNT(*) as count FROM coco_train "
                      "WHERE category IN "
                      "('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
                      "'traffic light', 'stop sign')")
print(result)

3. Train on the Filtered Subset

Pass the filtered view directly to a PyTorch DataLoader. Deeplake streams only the matching rows, so the training loop never touches irrelevant data.

train_loader = DataLoader(
    view.pytorch(),
    batch_size=16,
    shuffle=True,
    num_workers=4,
)

From here, the training loop is standard PyTorch, nothing Deeplake-specific.

4. Track Lineage

Every TQL query is deterministic at a given dataset commit. This means:

  • Reproducibility: re-running the same query on the same commit returns exactly the same rows.
  • Auditability: you can log the query string alongside your model checkpoint to record which data was used.
  • Iteration: changing the filter (e.g., adding 'pedestrian') creates a new view instantly, without re-downloading anything.

No external metadata store or DVC-style tracking is needed. The query is the lineage.

What to try next