Multimodal Asset Library¶
Modern AI applications rarely deal with just one type of data. Deeplake allows you to build a unified library where images, videos, audio, and text documents are stored in a single searchable table, enabling cross-modal discovery (e.g., finding a video clip by describing it in text).
Objective¶
Create a unified media library that automatically handles different file formats and allows for semantic search across all asset types using a single SQL query.
Prerequisites¶
- Deeplake SDK:
pip install deeplakeand AI stack:pip install torch transformers pillow accelerate(Python SDK tab) curland a terminal (REST API tab)- A Deeplake API token.
Set credentials first
Complete Code¶
import torch
from PIL import Image
from deeplake import Client
from transformers import AutoModel, AutoProcessor
# 1. Setup Unified Multimodal Encoder (ColQwen3)
# This model natively supports images, video frames, and text
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
def get_embedding(source_path=None, text=None, is_video=False):
"""Unified embedding generation for any asset type."""
if text:
inputs = processor.process_texts(texts=[text])
elif is_video:
# Video frames are processed as a sequence
feats = processor(videos=[source_path], return_tensors=None, videos_kwargs={"return_metadata": True})
feats.pop("video_metadata", None)
inputs = feats.convert_to_tensors(tensor_type="pt")
else:
img = Image.open(source_path).convert("RGB")
inputs = processor.process_images(images=[img])
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.inference_mode():
out = model(**inputs)
return out.embeddings[0].cpu().float().numpy().tolist()
# 2. Setup Multimodal Library
client = Client()
assets = [
{"path": "warehouse.jpg", "type": "image"},
{"path": "assembly.mp4", "type": "video"},
{"path": "report.pdf", "type": "pdf"}
]
# Generate semantic vectors for each asset type
embeddings = [get_embedding(source_path=a["path"], is_video=(a["type"]=="video")) for a in assets]
client.ingest("media_library", {
"source": [a["path"] for a in assets],
"media_type": [a["type"] for a in assets],
"embedding": embeddings
})
# 3. Cross-Modal Semantic Search
# Find any asset (image, video, PDF) using a text description
query_text = "logistics operations in a warehouse"
query_emb = get_embedding(text=query_text)
emb_pg = "{" + ",".join(str(x) for x in query_emb) + "}"
results = client.query("""
SELECT source, media_type, embedding <#> $1::float4[] AS score
FROM media_library ORDER BY score DESC LIMIT 5
""", (emb_pg,))
for r in results:
print(f"[{r['score']:.4f}] Found {r['media_type']}: {r['source']}")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
# 1. Create a unified media library table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" (id BIGSERIAL PRIMARY KEY, filename TEXT, page_index INT, start_time FLOAT4, media_type TEXT, embedding FLOAT4[], file_id UUID) USING deeplake"
}'
# 2. Insert media metadata
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" (filename, media_type, embedding, file_id) VALUES ($1, $2, $3::float4[], $4::uuid)",
"params": ["warehouse.jpg", "image", "{0.1,0.2,0.3}", "550e8400-e29b-41d4-a716-446655440000"]
}'
# 3. Cross-modal search (find any asset by text description)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "SELECT filename, page_index, start_time, embedding <#> $1::float4[] AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"media_library\" ORDER BY score DESC LIMIT 3",
"params": ["{0.1,0.2,0.3}"]
}'
Step-by-Step Breakdown¶
1. Automatic Type Detection¶
The client.ingest() method uses file extensions and magic bytes to route files to the correct processing engine. Videos are chunked, PDFs are rendered, and images are stored as binary blobs, all within the same managed table.
2. Cross-Modal Embedding¶
By using a joint embedding space (like CLIP or ColQwen3), you can map text, images, and video frames into the same vector space. This allows you to find a specific frame in a video or a specific page in a PDF using a single text-based SQL query.
3. Unified Retrieval¶
Because all assets live in one table, you can perform complex filtered searches across the entire dataset. For example: SELECT * FROM media_library WHERE filename LIKE 'warehouse%' AND embedding <#> '{0.1,0.2,0.3}'::float4[] > 0.8 allows you to search only inside specific files for a concept.
What to try next¶
- Image Search: deeper focus on visual-only similarity.
- Video Retrieval: advanced temporal search for video.
- Advanced Multimodal RAG: build a RAG system using this library.