Video Retrieval¶
Searching inside video content requires splitting long videos into manageable clips, embedding those clips (often with multi-vector models like MaxSim), and retrieving the exact timestamp of a scene.
Objective¶
Build a video search engine that can find specific moments in long videos using natural language queries.
Prerequisites¶
- Deeplake SDK:
pip install deeplakeand video AI stack:pip install torch transformers accelerate(Python SDK tab) curland a terminal (REST API tab)- System dependency:
ffmpeg(sudo apt-get install ffmpeg) (Python SDK tab only) - A Deeplake API token.
Set credentials first
Complete Code¶
import torch
from deeplake import Client
from transformers import AutoModel, AutoProcessor
# 1. Setup ColQwen3 Multi-Vector Encoder
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(
MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device
).eval()
def get_video_embedding(video_path):
"""Encodes a video into a multi-vector (num_tokens, dim)."""
feats = processor(videos=[video_path], return_tensors=None, videos_kwargs={"return_metadata": True})
feats.pop("video_metadata", None)
inputs = feats.convert_to_tensors(tensor_type="pt")
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.inference_mode():
out = model(**inputs)
return out.embeddings[0].cpu().float().numpy().tolist() # FLOAT4[][]
# 2. Setup and Ingest
client = Client()
video_paths = ["warehouse_cam1.mp4", "loading_dock.mp4"]
# Generate multi-vector embeddings for each video segment
embeddings = [get_video_embedding(v) for v in video_paths]
client.ingest("video_catalog", {
"video": video_paths,
"embedding": embeddings # Stored as FLOAT4[][] for MaxSim search
})
# 3. MaxSim Search: finding specific moments
query_text = "forklift moving pallets"
query_inputs = processor.process_texts(texts=[query_text])
query_inputs = {k: v.to(device) for k, v in query_inputs.items()}
with torch.inference_mode():
query_emb = model(**query_inputs).embeddings[0].cpu().float().numpy().tolist()
# MaxSim matches query tokens against the best parts of the multi-vector video representation
multi_pg = "ARRAY[" + ",".join("ARRAY[" + ",".join(str(x) for x in vec) + "]" for vec in query_emb) + "]::float4[][]"
results = client.query(f"""
SELECT filename, start_time, end_time,
embedding <#> {multi_pg} AS score
FROM video_catalog ORDER BY score DESC LIMIT 5
""")
# Note: multi-vector (FLOAT4[][]) params require inline ARRAY[] syntax
for r in results:
print(f"Moment: {r['start_time']}s - {r['end_time']}s | Score: {r['score']:.4f}")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
# 1. Create table with multi-vector (FLOAT4[][]) support
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" (id BIGSERIAL PRIMARY KEY, start_time FLOAT4, end_time FLOAT4, embedding FLOAT4[][], file_id UUID) USING deeplake"
}'
# 2. Insert a video segment
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" (start_time, end_time, embedding, file_id) VALUES (0.0, 10.0, $1::float4[][], $2::uuid)",
"params": ["{{0.1,0.2},{0.3,0.4}}", "550e8400-e29b-41d4-a716-446655440000"]
}'
# 3. MaxSim Multi-vector Search
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "SELECT start_time, end_time, embedding <#> $1::float4[][] AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" ORDER BY score DESC LIMIT 5",
"params": ["{{0.12,0.22},{0.32,0.42}}"]
}'
Step-by-Step Breakdown¶
1. Automatic Segmentation¶
Deeplake's SDK simplifies video pipelines. When you pass video files to client.ingest(), it uses ffmpeg internally to:
- Detect scene changes or fixed-time intervals (default 10s).
- Extract a representative thumbnail for each segment.
- Extract audio or visual descriptions if an encoder is provided.
2. Multi-Vector Indexing (MaxSim)¶
Unlike images which often use a single vector, complex video scenes are better represented by multiple vectors (one per keyframe or object). Deeplake's FLOAT4[][] type and the <#> operator support MaxSim (Maximum Similarity), which matches query tokens against the best-matching parts of the video.
3. Temporal Retrieval¶
Because each row represents a segment, the search results directly provide the start_time and end_time. This allows your application to jump directly to the relevant part of the video stream.
What to try next¶
- Multimodal Library: combine video search with text documents.
- Retrieval to Training: use retrieved clips to fine-tune a model.
- Search Guide: deep dive into MaxSim logic.