Training with Data Lineage¶
Instead of downloading an entire dataset and writing scripts to filter it locally, you can use TQL queries directly on a managed table to create focused training subsets. The same query always produces the same view at a given commit, giving you reproducible data lineage for every experiment.
Objective¶
Ingest a COCO-style dataset into Deeplake, use TQL to select only the categories relevant to a driving scenario, and train a Faster R-CNN detector on that filtered subset. All without downloading or duplicating data.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - Deep Learning framework:
pip install torch torchvision - A Deeplake API token.
Set credentials first
Complete Code¶
import torch
import torchvision
from torch.utils.data import DataLoader
from deeplake import Client
# 1. Setup
client = Client()
# 2. Ingest COCO from HuggingFace (skip if already ingested)
client.ingest("coco_train", {"_huggingface": "detection-datasets/coco"})
# 3. Open the table and filter with TQL
ds = client.open_table("coco_train")
view = ds.query(
"SELECT * WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)
# 4. Verify the filtered subset
result = client.query("SELECT COUNT(*) as count FROM coco_train "
"WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
"'traffic light', 'stop sign')")
print(f"Training on {result} samples")
# 5. Create a DataLoader from the filtered view
train_loader = DataLoader(
view.pytorch(),
batch_size=16,
shuffle=True,
num_workers=4,
)
# 6. Train Faster R-CNN on the subset
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(weights="DEFAULT")
model.to(device)
model.train()
optimizer = torch.optim.SGD(model.parameters(), lr=0.005, momentum=0.9)
for epoch in range(3):
for i, batch in enumerate(train_loader):
images = [img.to(device) for img in batch["image"]]
targets = [
{"boxes": t["boxes"].to(device), "labels": t["labels"].to(device)}
for t in batch["annotations"]
]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()
if i % 50 == 0:
print(f"Epoch {epoch}, Batch {i}, Loss: {losses.item():.4f}")
Step-by-Step Breakdown¶
1. Ingest the Dataset¶
You can load COCO directly from HuggingFace with a single call. Deeplake stores the data in a managed table. No local download required.
If you already have COCO annotations locally, use the CocoPanoptic format instead:
from deeplake.managed.formats import CocoPanoptic
client.ingest("coco_train", format=CocoPanoptic(
images_dir="path/to/images",
masks_dir="path/to/masks",
annotations="path/to/annotations.json",
))
2. Query and Filter with TQL¶
Open the table and run a TQL query to create a filtered view. This does not copy or move any data. It returns a lightweight view that points to the matching rows.
ds = client.open_table("coco_train")
view = ds.query(
"SELECT * WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', 'traffic light', 'stop sign')"
)
You can verify the size of your subset with a count query:
result = client.query("SELECT COUNT(*) as count FROM coco_train "
"WHERE category IN "
"('car', 'truck', 'bus', 'motorcycle', 'bicycle', "
"'traffic light', 'stop sign')")
print(result)
3. Train on the Filtered Subset¶
Pass the filtered view directly to a PyTorch DataLoader. Deeplake streams only the matching rows, so the training loop never touches irrelevant data.
From here, the training loop is standard PyTorch, nothing Deeplake-specific.
4. Track Lineage¶
Every TQL query is deterministic at a given dataset commit. This means:
- Reproducibility: re-running the same query on the same commit returns exactly the same rows.
- Auditability: you can log the query string alongside your model checkpoint to record which data was used.
- Iteration: changing the filter (e.g., adding
'pedestrian') creates a new view instantly, without re-downloading anything.
No external metadata store or DVC-style tracking is needed. The query is the lineage.
What to try next¶
- GPU-Streaming Pipeline: optimize throughput for large-scale training.
- Massive Ingestion: prepare large-scale datasets for training.
- Image Classification Training: a simpler single-label training example.