Training Object Detection in PyTorch¶
Deeplake makes it easy to train object detection models by streaming annotation-rich datasets directly into your PyTorch training loop. This tutorial shows how to ingest COCO, build a detection-ready DataLoader, and fine-tune a Faster R-CNN model.
Objective¶
Ingest the COCO dataset into a Deeplake managed table and train a Faster R-CNN ResNet-50 FPN model using PyTorch, streaming data directly from the cloud.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - PyTorch with torchvision:
pip install torch torchvision - A Deeplake API token.
- COCO 2017 dataset files (or use the HuggingFace shortcut below).
Set credentials first
Complete Code¶
import torch
from torch.utils.data import DataLoader
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from deeplake import Client
# 1. Setup
client = Client()
# Option A: From local COCO files
from deeplake.managed.formats import CocoPanoptic
client.ingest(
"coco_train",
format=CocoPanoptic(
images_dir="coco/train2017",
masks_dir="coco/panoptic_train2017",
annotations="coco/annotations/panoptic_train2017.json",
),
)
# Option B: From HuggingFace (one-liner alternative)
# client.ingest("coco_train", {"_huggingface": "detection-datasets/coco"})
# 2. Open the table and create a DataLoader
ds = client.open_table("coco_train")
def collate_fn(batch):
"""Object detection requires variable-length targets per image,
so we bypass the default stack-based collation."""
return tuple(zip(*batch))
train_loader = DataLoader(
ds.pytorch(),
batch_size=8,
shuffle=True,
num_workers=4,
collate_fn=collate_fn,
)
# 3. Define the model
num_classes = 91 # COCO has 80 categories + background (standard mapping uses ids up to 90)
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
# 4. Training loop
optimizer = torch.optim.SGD(
model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005
)
model.train()
for epoch in range(3):
epoch_loss = 0.0
for i, (images, targets) in enumerate(train_loader):
images = [img.to(device) for img in images]
targets = [{k: v.to(device) for k, v in t.items()} for t in targets]
loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())
optimizer.zero_grad()
losses.backward()
optimizer.step()
epoch_loss += losses.item()
if i % 50 == 0:
print(f"Epoch {epoch}, Batch {i}, Loss: {losses.item():.4f}")
print(f"Epoch {epoch} complete - Avg Loss: {epoch_loss / (i + 1):.4f}")
torch.save(model.state_dict(), "fasterrcnn_coco.pth")
print("Model saved.")
Step-by-Step Breakdown¶
1. Ingest COCO Dataset¶
Deeplake supports ingesting COCO directly via the CocoPanoptic format, which handles images, masks, and annotations in one call:
from deeplake.managed.formats import CocoPanoptic
client.ingest(
"coco_train",
format=CocoPanoptic(
images_dir="coco/train2017",
masks_dir="coco/panoptic_train2017",
annotations="coco/annotations/panoptic_train2017.json",
),
)
If you don't have COCO downloaded locally, you can ingest it from HuggingFace in a single line:
Both approaches create a managed table with image, bounding box, category, and mask columns ready for training.
2. Create a Detection-Ready DataLoader¶
Object detection models expect a list of images and a list of per-image target dictionaries (each containing boxes, labels, etc.). Since the number of objects varies per image, the default PyTorch collation (which tries to stack tensors) will fail. The standard fix is a custom collate_fn:
This returns (images_tuple, targets_tuple) instead of attempting to stack variable-length annotations into a single tensor.
3. Define the Model¶
We start from a pretrained Faster R-CNN with a ResNet-50 FPN backbone and replace the classification head to match the COCO category count:
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)
The pretrained backbone provides strong feature extraction out of the box; only the box predictor head is re-initialized for your target classes.
4. Training Loop¶
Faster R-CNN computes its own loss internally when given both images and targets. The model returns a dictionary of losses (loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg) which are summed for backpropagation:
What to try next¶
- GPU-Streaming Pipeline: learn more about Deeplake's streaming architecture.
- Massive Ingestion: prepare large-scale datasets for training.
- MMDetection Integration: use Deeplake with the MMDetection framework.