Multimodal Asset Library¶

Modern AI applications rarely deal with just one type of data. Deeplake allows you to build a unified library where images, videos, audio, and text documents are stored in a single searchable table, enabling cross-modal discovery (e.g., finding a video clip by describing it in text).

Objective¶

Create a unified media library that automatically handles different file formats and allows for semantic search across all asset types using a single SQL query.

Prerequisites¶

Deeplake SDK: pip install deeplake and AI stack: pip install torch "transformers>=4.48,<5" pillow accelerate (Python SDK tab)
curl and a terminal (REST API tab)
A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code¶

Python SDKREST API

import torch
from PIL import Image
from deeplake import Client
from transformers import AutoModel, AutoProcessor

# 1. Setup Unified Multimodal Encoder (ColQwen3)
# This model natively supports images, video frames, and text
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(
    MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device
).eval()

def get_embedding(source_path=None, text=None, is_video=False):
    """Unified embedding generation for any asset type."""
    if text:
        inputs = processor.process_texts(texts=[text])
    elif is_video:
        # Video frames are processed as a sequence
        feats = processor(videos=[source_path], return_tensors=None, videos_kwargs={"return_metadata": True})
        feats.pop("video_metadata", None)
        inputs = feats.convert_to_tensors(tensor_type="pt")
    else:
        img = Image.open(source_path).convert("RGB")
        inputs = processor.process_images(images=[img])

    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.inference_mode():
        out = model(**inputs)
    return out.embeddings[0].cpu().float().numpy()

# 2. Setup Multimodal Library
client = Client()

# Multi-vector columns (FLOAT4[][]) require CREATE TABLE, then ingest into the existing table
client.query("""
    CREATE TABLE IF NOT EXISTS "media_library" (
        source TEXT,
        media_type TEXT,
        embedding FLOAT4[][]
    ) USING deeplake
""")

assets = [
    {"path": "warehouse.jpg", "type": "image"},
    {"path": "assembly.mp4", "type": "video"},
    {"path": "report.pdf", "type": "pdf"}
]

# Generate semantic vectors for each asset type
client.ingest("media_library", {
    "source": [a["path"] for a in assets],
    "media_type": [a["type"] for a in assets],
    "embedding": [get_embedding(source_path=a["path"], is_video=(a["type"]=="video")) for a in assets],
})

# 3. Cross-Modal Semantic Search
# Find any asset (image, video, PDF) using a text description
query_text = "logistics operations in a warehouse"
query_emb = get_embedding(text=query_text).tolist()

# Format multi-vector as PG array literal: {{v1},{v2},...}
emb_pg = "{" + ",".join(
    "{" + ",".join(str(v) for v in row) + "}" for row in query_emb
) + "}"
results = (
    client.table("media_library")
        .select("source", "media_type",
                f"embedding <#> '{emb_pg}'::float4[][] AS score")
        .order_by("score DESC")
        .limit(5)
        .execute()
)

for r in results:
    print(f"[{r['score']:.4f}] Found {r['media_type']}: {r['source']}")

# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"

# 1. Create a unified media library table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" (id BIGSERIAL PRIMARY KEY, source TEXT, media_type TEXT, embedding FLOAT4[][]) USING deeplake"
  }'

# 2. Insert media metadata
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" (source, media_type, embedding) VALUES ($1, $2, $3::float4[][])",
    "params": ["warehouse.jpg", "image", "{{0.1,0.2,0.3},{0.4,0.5,0.6}}"]
  }'

# 3. Cross-modal search (find any asset by text description)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "SELECT source, media_type, embedding <#> $1::float4[][] AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" ORDER BY score DESC LIMIT 3",
    "params": ["{{0.12,0.22,0.32},{0.42,0.52,0.62}}"]
  }'

Step-by-Step Breakdown¶

1. Automatic Type Detection¶

The client.ingest() method uses file extensions and magic bytes to route files to the correct processing engine. Videos are chunked, PDFs are rendered, and images are stored as binary blobs, all within the same managed table.

By using a joint embedding space (like CLIP or ColQwen3), you can map text, images, and video frames into the same vector space. This allows you to find a specific frame in a video or a specific page in a PDF using a single text-based SQL query.

3. Unified Retrieval¶

Because all assets live in one table, you can perform complex filtered searches across the entire dataset. For example: SELECT * FROM media_library WHERE filename LIKE 'warehouse%' AND embedding <#> '{0.1,0.2,0.3}'::float4[] > 0.8 allows you to search only inside specific files for a concept.

What to try next¶

Image Search: deeper focus on visual-only similarity.
Video Retrieval: advanced temporal search for video.
Advanced Multimodal RAG: build a RAG system using this library.