Video Retrieval¶

Searching inside video content requires splitting long videos into manageable clips, embedding those clips (often with multi-vector models like MaxSim), and retrieving the exact timestamp of a scene.

Objective¶

Build a video search engine that can find specific moments in long videos using natural language queries.

Prerequisites¶

Deeplake SDK: pip install deeplake and video AI stack: pip install torch "transformers>=4.48,<5" accelerate (Python SDK tab)
curl and a terminal (REST API tab)
System dependency: ffmpeg (sudo apt-get install ffmpeg) (Python SDK tab only)
A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code¶

Python SDKREST API

import os
import torch
from deeplake import Client
from transformers import AutoModel, AutoProcessor

# 1. Setup ColQwen3 Multi-Vector Encoder
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(
    MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device
).eval()

def get_video_embedding(video_path):
    """Encodes a video into a multi-vector (num_tokens, dim)."""
    feats = processor(videos=[video_path], return_tensors=None, videos_kwargs={"return_metadata": True})
    feats.pop("video_metadata", None)
    inputs = feats.convert_to_tensors(tensor_type="pt")
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.inference_mode():
        out = model(**inputs)
    return out.embeddings[0].cpu().float().numpy().tolist() # FLOAT4[][]

# 2. Setup and Ingest
client = Client()
WORKSPACE = os.environ.get("DEEPLAKE_WORKSPACE", "default")
video_paths = ["warehouse_cam1.mp4", "loading_dock.mp4"]

# Generate multi-vector embeddings for each video segment
embeddings = [get_video_embedding(v) for v in video_paths]

# Multi-vector columns (FLOAT4[][]) require CREATE TABLE, then ingest into the existing table
client.query(f"""
    CREATE TABLE IF NOT EXISTS "{WORKSPACE}"."video_catalog" (
        video TEXT,
        start_time FLOAT4,
        end_time FLOAT4,
        embedding FLOAT4[][]
    ) USING deeplake
""")
client.ingest("video_catalog", {
    "video": video_paths,
    "start_time": [0.0, 0.0],
    "end_time": [10.0, 15.0],
    "embedding": embeddings,
})

# 3. MaxSim Search: finding specific moments
query_text = "forklift moving pallets"
query_inputs = processor.process_texts(texts=[query_text])
query_inputs = {k: v.to(device) for k, v in query_inputs.items()}
with torch.inference_mode():
    query_emb = model(**query_inputs).embeddings[0].cpu().float().numpy().tolist()

# MaxSim matches query tokens against the best parts of the multi-vector video representation
multi_pg = "ARRAY[" + ",".join("ARRAY[" + ",".join(str(x) for x in vec) + "]" for vec in query_emb) + "]::float4[][]"
results = (
    client.table("video_catalog")
        .select("video", "start_time", "end_time",
                f"embedding <#> {multi_pg} AS score")
        .order_by("score DESC")
        .limit(5)
        .execute()
)
# Note: multi-vector (FLOAT4[][]) params require inline ARRAY[] syntax

for r in results:
    print(f"Moment: {r['start_time']}s - {r['end_time']}s | Score: {r['score']:.4f}")

# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"

# 1. Create table with multi-vector (FLOAT4[][]) support
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" (id BIGSERIAL PRIMARY KEY, start_time FLOAT4, end_time FLOAT4, embedding FLOAT4[][], file_id UUID) USING deeplake"
  }'

# 2. Insert a video segment
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" (start_time, end_time, embedding, file_id) VALUES (0.0, 10.0, $1::float4[][], $2::uuid)",
    "params": ["{{0.1,0.2},{0.3,0.4}}", "550e8400-e29b-41d4-a716-446655440000"]
  }'

# 3. MaxSim Multi-vector Search
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "SELECT start_time, end_time, embedding <#> $1::float4[][] AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"video_catalog\" ORDER BY score DESC LIMIT 5",
    "params": ["{{0.12,0.22},{0.32,0.42}}"]
  }'

Step-by-Step Breakdown¶

1. Automatic Segmentation¶

Deeplake's SDK simplifies video pipelines. When you pass video files to client.ingest(), it uses ffmpeg internally to: - Detect scene changes or fixed-time intervals (default 10s). - Extract a representative thumbnail for each segment. - Extract audio or visual descriptions if an encoder is provided.

2. Multi-Vector Indexing (MaxSim)¶

Unlike images which often use a single vector, complex video scenes are better represented by multiple vectors (one per keyframe or object). Deeplake's FLOAT4[][] type and the <#> operator support MaxSim (Maximum Similarity), which matches query tokens against the best-matching parts of the video.

3. Temporal Retrieval¶

Because each row represents a segment, the search results directly provide the start_time and end_time. This allows your application to jump directly to the relevant part of the video stream.

What to try next¶

Multimodal Library: combine video search with text documents.
Retrieval to Training: use retrieved clips to fine-tune a model.
Search Guide: deep dive into MaxSim logic.