Advanced Multimodal RAG¶

Enterprise RAG (Retrieval-Augmented Generation) in 2026 goes beyond simple vector lookup. To achieve production-grade reliability, you need to combine keyword matches (BM25) for precision, vector search for meaning, and visual cues from images or diagrams.

Objective¶

Build a RAG system for a technical manual that contains both text descriptions and circuit diagrams (images). The system retrieves the best context using hybrid search and passes it to an LLM.

Prerequisites¶

Deeplake SDK: pip install deeplake
Multimodal AI stack: pip install torch "transformers>=4.48,<5" pillow accelerate
A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code¶

Python SDKREST API

import torch
from PIL import Image
from deeplake import Client
from transformers import AutoModel, AutoProcessor

# 1. Setup Enterprise Multimodal Encoder (ColQwen3)
# This model creates a joint space for text, images, and documents
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(
    MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device
).eval()

def get_joint_embedding(text=None, image_path=None):
    """Generates a multi-vector embedding for multimodal context."""
    if image_path:
        img = Image.open(image_path).convert("RGB")
        inputs = processor.process_images(images=[img])
    else:
        inputs = processor.process_texts(texts=[text])

    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.inference_mode():
        out = model(**inputs)
    return out.embeddings[0].cpu().float().numpy()

# 2. Setup and Ingest Manual Pages with Diagrams
client = Client()

# Multi-vector columns (FLOAT4[][]) require CREATE TABLE, then ingest into the existing table
client.query("""
    CREATE TABLE IF NOT EXISTS "technical_manual" (
        page_id INT8,
        content TEXT,
        image TEXT,
        embedding FLOAT4[][]
    ) USING deeplake
""")

pages = [
    {"id": 1, "text": "Voltage regulator pinout", "img": "diag_01.png"},
    {"id": 2, "text": "Cooling fan placement", "img": "diag_02.png"}
]

# Use the visual diagrams to generate search vectors
client.ingest("technical_manual", {
    "page_id": [p["id"] for p in pages],
    "content": [p["text"] for p in pages],
    "image": [p["img"] for p in pages],
    "embedding": [get_joint_embedding(image_path=p["img"]) for p in pages],
})

# 3. Hybrid Multimodal Retrieval
query_text = "how to fix the voltage regulator?"
query_emb = get_joint_embedding(text=query_text).tolist()

# Format multi-vector as PG array literal: {{v1},{v2},...}
emb_pg = "{" + ",".join(
    "{" + ",".join(str(v) for v in row) + "}" for row in query_emb
) + "}"
context_rows = (
    client.table("technical_manual")
        .select("content", "image",
                f"(embedding, content)::deeplake_hybrid_record <#> deeplake_hybrid_record('{emb_pg}'::float4[][], '{query_text}', 0.6, 0.4) AS score")
        .where("page_id < 100")
        .order_by("score DESC")
        .limit(3)
        .execute()
)

# 4. Pass Context (Text + Image) to Multimodal LLM
for row in context_rows:
    print(f"Retrieved: {row['content']} | Image: {row['image']}")

# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"

# Hybrid search with metadata filtering via REST SQL
# Combines vector similarity (0.6 weight) with BM25 text match (0.4 weight)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "SELECT content, image, (embedding, content) <#> deeplake_hybrid_record($1::float4[][], $2, 0.6, 0.4) AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"technical_manual\" WHERE page_id < 10 ORDER BY score DESC LIMIT 3",
    "params": ["{{0.12,0.22},{0.32,0.42}}", "voltage regulator"]
  }'

Step-by-Step Breakdown¶

1. The "Retriever Intelligence"¶

Simple vector stores often suffer from "semantic drift" where they return documents that are conceptually similar but lack the exact keywords required. By using Deeplake's Hybrid Search (deeplake_hybrid_record), we ensure that exact terms (like "voltage regulator") are prioritized via BM25, while the vector search captures the overall meaning.

2. SQL-First Filtering¶

The true power of Deeplake is combining similarity search with structured SQL. In the example above, the WHERE page_id < 10 clause filters the data before the expensive similarity calculation occurs, significantly improving latency and precision.

3. Multimodal Context¶

Because Deeplake stores the raw images alongside the text, your RAG pipeline can pass both the text and the visual diagram to models like GPT-4o. This allows the model to "see" the circuit diagram while "reading" the troubleshooting steps.

What to try next¶

Search Guide: tune your Hybrid Search weights (e.g., 0.8 vector vs 0.2 text).
PDF Processing: how to auto-ingest PDF manuals into this schema.
Image Search: focus purely on visual similarity.