Skip to content

Multimodal Asset Library

Modern AI applications rarely deal with just one type of data. Deeplake allows you to build a unified library where images, videos, audio, and text documents are stored in a single searchable table, enabling cross-modal discovery (e.g., finding a video clip by describing it in text).

Objective

Create a unified media library that automatically handles different file formats and allows for semantic search across all asset types using a single SQL query.

Prerequisites

  • Deeplake SDK: pip install deeplake and AI stack: pip install torch transformers pillow accelerate (Python SDK tab)
  • curl and a terminal (REST API tab)
  • A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code

import torch
from PIL import Image
from deeplake import Client
from transformers import AutoModel, AutoProcessor

# 1. Setup Unified Multimodal Encoder (ColQwen3)
# This model natively supports images, video frames, and text
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)

def get_embedding(source_path=None, text=None, is_video=False):
    """Unified embedding generation for any asset type."""
    if text:
        inputs = processor.process_texts(texts=[text])
    elif is_video:
        # Video frames are processed as a sequence
        feats = processor(videos=[source_path], return_tensors=None, videos_kwargs={"return_metadata": True})
        feats.pop("video_metadata", None)
        inputs = feats.convert_to_tensors(tensor_type="pt")
    else:
        img = Image.open(source_path).convert("RGB")
        inputs = processor.process_images(images=[img])

    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.inference_mode():
        out = model(**inputs)
    return out.embeddings[0].cpu().float().numpy().tolist()

# 2. Setup Multimodal Library
client = Client()

assets = [
    {"path": "warehouse.jpg", "type": "image"},
    {"path": "assembly.mp4", "type": "video"},
    {"path": "report.pdf", "type": "pdf"}
]

# Generate semantic vectors for each asset type
embeddings = [get_embedding(source_path=a["path"], is_video=(a["type"]=="video")) for a in assets]

client.ingest("media_library", {
    "source": [a["path"] for a in assets],
    "media_type": [a["type"] for a in assets],
    "embedding": embeddings
})

# 3. Cross-Modal Semantic Search
# Find any asset (image, video, PDF) using a text description
query_text = "logistics operations in a warehouse"
query_emb = get_embedding(text=query_text)

emb_pg = "{" + ",".join(str(x) for x in query_emb) + "}"
results = client.query("""
    SELECT source, media_type, embedding <#> $1::float4[] AS score
    FROM media_library ORDER BY score DESC LIMIT 5
""", (emb_pg,))

for r in results:
    print(f"[{r['score']:.4f}] Found {r['media_type']}: {r['source']}")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"

# 1. Create a unified media library table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" (id BIGSERIAL PRIMARY KEY, filename TEXT, page_index INT, start_time FLOAT4, media_type TEXT, embedding FLOAT4[], file_id UUID) USING deeplake"
  }'

# 2. Insert media metadata
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" (filename, media_type, embedding, file_id) VALUES ($1, $2, $3::float4[], $4::uuid)",
    "params": ["warehouse.jpg", "image", "{0.1,0.2,0.3}", "550e8400-e29b-41d4-a716-446655440000"]
  }'

# 3. Cross-modal search (find any asset by text description)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "SELECT filename, page_index, start_time, embedding <#> $1::float4[] AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" ORDER BY score DESC LIMIT 3",
    "params": ["{0.1,0.2,0.3}"]
  }'

Step-by-Step Breakdown

1. Automatic Type Detection

The client.ingest() method uses file extensions and magic bytes to route files to the correct processing engine. Videos are chunked, PDFs are rendered, and images are stored as binary blobs, all within the same managed table.

2. Cross-Modal Embedding

By using a joint embedding space (like CLIP or ColQwen3), you can map text, images, and video frames into the same vector space. This allows you to find a specific frame in a video or a specific page in a PDF using a single text-based SQL query.

3. Unified Retrieval

Because all assets live in one table, you can perform complex filtered searches across the entire dataset. For example: SELECT * FROM media_library WHERE filename LIKE 'warehouse%' AND embedding <#> '{0.1,0.2,0.3}'::float4[] > 0.8 allows you to search only inside specific files for a concept.

What to try next