Retrieval to Training¶
Traditionally, building a training set involves complex ETL (Extract, Transform, Load) pipelines to move data from a database to a format usable by a model. Deeplake eliminates this step by allowing you to define your training set as a SQL query and stream matching rows directly to your GPU.
Objective¶
Retrieve the top-k most relevant clips from a massive video lake based on a training intent (e.g., "robotic picking") and prepare them for a fine-tuning loop.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - Deep Learning framework (PyTorch/TensorFlow).
- A Deeplake API token.
Set credentials first
Complete Code¶
import torch
from deeplake import Client
# 1. Setup
client = Client()
# 2. Retrieve Training Candidates via SQL
# We use vector search to find specific actions across 1M+ clips
training_intent = "robot arm picking up a glass bottle"
# query_emb = model.encode(training_intent).tolist()
query_emb = [0.12, 0.42, -0.1] # Placeholder for intent embedding
# Define the training set dynamically as a query
emb_pg = "{" + ",".join(str(x) for x in query_emb) + "}"
matches = client.query("""
SELECT id, file_id, start_time, end_time
FROM video_catalog
ORDER BY embedding <#> $1::float4[] DESC
LIMIT 1000
""", (emb_pg,))
# Collect the unique IDs of the matching segments
matched_ids = [m['id'] for m in matches]
print(f"Defined training set with {len(matched_ids)} relevant clips.")
# 3. Direct Streaming to GPU Loop
# No need to download the clips to disk. We use the matches to 'view' the dataset.
ds = client.open_table("video_catalog")
# Create a training view: This is a pointer-based subset of the original table.
training_view = ds[matched_ids]
# Stream the view directly to PyTorch
from torch.utils.data import DataLoader
dataloader = DataLoader(training_view.pytorch(), batch_size=16, num_workers=4)
for batch in dataloader:
# Train model with retrieved segments: batch['video_feed']
# images = batch['video_feed'].to(device)
print(f"Loading batch of {len(batch['id'])} retrieved video segments...")
break
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
# Retrieve training candidate IDs via vector similarity search
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "SELECT id FROM \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" ORDER BY embedding <#> $1::float4[] DESC LIMIT 100",
"params": ["{0.1,0.2,0.3}"]
}'
Step-by-Step Breakdown¶
1. Training Set as a Query¶
In the Deeplake architecture, your training set is dynamic. If your model's accuracy is low on "night scenes", you can simply update your SQL query to WHERE is_night = True and immediately restart your training loop on the new slice of data.
2. Eliminating Data Redundancy¶
Because the training loop streams data directly from the managed lake, you don't need to maintain multiple copies of the data (e.g., one in a database and one as a .tar file for training). The lake is the source of truth for both metadata and tensors.
3. High-Performance Filtering¶
Deeplake's USING deeplake_index ensures that even when you have millions of rows, the retrieval step takes milliseconds. This allows for interactive "Human-in-the-loop" training where developers can refine the training set query in real-time.
What to try next¶
- GPU-Streaming Guide: technical details of the streaming kernels.
- Massive Ingestion: how to populate the lake with millions of candidates.
- Physical AI & Robotics: use case for robotics imitation learning.