Skip to content

Training Object Detection in PyTorch

Deeplake makes it easy to train object detection models by streaming annotation-rich datasets directly into your PyTorch training loop. This tutorial shows how to ingest COCO, build a detection-ready DataLoader, and fine-tune a Faster R-CNN model.

Objective

Ingest the COCO dataset into a Deeplake managed table and train a Faster R-CNN ResNet-50 FPN model using PyTorch, streaming data directly from the cloud.

Prerequisites

  • Deeplake SDK: pip install deeplake
  • PyTorch with torchvision: pip install torch torchvision
  • A Deeplake API token.
  • COCO 2017 dataset files (or use the HuggingFace shortcut below).

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code

import torch
from torch.utils.data import DataLoader
from torchvision.models.detection import fasterrcnn_resnet50_fpn
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from deeplake import Client

# 1. Setup
client = Client()

# Option A: From local COCO files
from deeplake.managed.formats import CocoPanoptic

client.ingest(
    "coco_train",
    format=CocoPanoptic(
        images_dir="coco/train2017",
        masks_dir="coco/panoptic_train2017",
        annotations="coco/annotations/panoptic_train2017.json",
    ),
)

# Option B: From HuggingFace (one-liner alternative)
# client.ingest("coco_train", {"_huggingface": "detection-datasets/coco"})

# 2. Open the table and create a DataLoader
ds = client.open_table("coco_train")

def collate_fn(batch):
    """Object detection requires variable-length targets per image,
    so we bypass the default stack-based collation."""
    return tuple(zip(*batch))

train_loader = DataLoader(
    ds.pytorch(),
    batch_size=8,
    shuffle=True,
    num_workers=4,
    collate_fn=collate_fn,
)

# 3. Define the model
num_classes = 91  # COCO has 80 categories + background (standard mapping uses ids up to 90)
model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# 4. Training loop
optimizer = torch.optim.SGD(
    model.parameters(), lr=0.005, momentum=0.9, weight_decay=0.0005
)

model.train()
for epoch in range(3):
    epoch_loss = 0.0
    for i, (images, targets) in enumerate(train_loader):
        images = [img.to(device) for img in images]
        targets = [{k: v.to(device) for k, v in t.items()} for t in targets]

        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())

        optimizer.zero_grad()
        losses.backward()
        optimizer.step()

        epoch_loss += losses.item()
        if i % 50 == 0:
            print(f"Epoch {epoch}, Batch {i}, Loss: {losses.item():.4f}")

    print(f"Epoch {epoch} complete - Avg Loss: {epoch_loss / (i + 1):.4f}")

torch.save(model.state_dict(), "fasterrcnn_coco.pth")
print("Model saved.")

Step-by-Step Breakdown

1. Ingest COCO Dataset

Deeplake supports ingesting COCO directly via the CocoPanoptic format, which handles images, masks, and annotations in one call:

from deeplake.managed.formats import CocoPanoptic

client.ingest(
    "coco_train",
    format=CocoPanoptic(
        images_dir="coco/train2017",
        masks_dir="coco/panoptic_train2017",
        annotations="coco/annotations/panoptic_train2017.json",
    ),
)

If you don't have COCO downloaded locally, you can ingest it from HuggingFace in a single line:

client.ingest("coco_train", {"_huggingface": "detection-datasets/coco"})

Both approaches create a managed table with image, bounding box, category, and mask columns ready for training.

2. Create a Detection-Ready DataLoader

Object detection models expect a list of images and a list of per-image target dictionaries (each containing boxes, labels, etc.). Since the number of objects varies per image, the default PyTorch collation (which tries to stack tensors) will fail. The standard fix is a custom collate_fn:

def collate_fn(batch):
    return tuple(zip(*batch))

This returns (images_tuple, targets_tuple) instead of attempting to stack variable-length annotations into a single tensor.

3. Define the Model

We start from a pretrained Faster R-CNN with a ResNet-50 FPN backbone and replace the classification head to match the COCO category count:

model = fasterrcnn_resnet50_fpn(weights="DEFAULT")
in_features = model.roi_heads.box_predictor.cls_score.in_features
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

The pretrained backbone provides strong feature extraction out of the box; only the box predictor head is re-initialized for your target classes.

4. Training Loop

Faster R-CNN computes its own loss internally when given both images and targets. The model returns a dictionary of losses (loss_classifier, loss_box_reg, loss_objectness, loss_rpn_box_reg) which are summed for backpropagation:

loss_dict = model(images, targets)
losses = sum(loss for loss in loss_dict.values())

What to try next