Advanced Multimodal RAG¶
Enterprise RAG (Retrieval-Augmented Generation) in 2026 goes beyond simple vector lookup. To achieve production-grade reliability, you need to combine keyword matches (BM25) for precision, vector search for meaning, and visual cues from images or diagrams.
Objective¶
Build a RAG system for a technical manual that contains both text descriptions and circuit diagrams (images). The system retrieves the best context using hybrid search and passes it to an LLM.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - Multimodal AI stack:
pip install torch transformers pillow accelerate - A Deeplake API token.
Set credentials first
Complete Code¶
import torch
from PIL import Image
from deeplake import Client
from transformers import AutoModel, AutoProcessor
# 1. Setup Enterprise Multimodal Encoder (ColQwen3)
# This model creates a joint space for text, images, and documents
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16).to(device)
def get_joint_embedding(text=None, image_path=None):
"""Generates a joint embedding for multimodal context."""
if image_path:
img = Image.open(image_path).convert("RGB")
inputs = processor.process_images(images=[img])
else:
inputs = processor.process_texts(texts=[text])
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.inference_mode():
out = model(**inputs)
return out.embeddings[0].cpu().float().numpy().tolist()
# 2. Setup and Ingest Manual Pages with Diagrams
client = Client()
pages = [
{"id": 1, "text": "Voltage regulator pinout", "img": "diag_01.png"},
{"id": 2, "text": "Cooling fan placement", "img": "diag_02.png"}
]
# Use the visual diagrams to generate search vectors
embeddings = [get_joint_embedding(image_path=p["img"]) for p in pages]
client.ingest("technical_manual", {
"page_id": [p["id"] for p in pages],
"content": [p["text"] for p in pages],
"image": [p["img"] for p in pages],
"embedding": embeddings
})
# 3. Hybrid Multimodal Retrieval
query_text = "how to fix the voltage regulator?"
query_emb = get_joint_embedding(text=query_text)
# Magic: Combine text query, visual embeddings, and metadata filters
emb_pg = "{" + ",".join(str(x) for x in query_emb) + "}"
context_rows = client.query("""
SELECT content, file_id,
(embedding, content)::deeplake_hybrid_record <#>
deeplake_hybrid_record($1::float4[], $2, 0.6, 0.4) AS score
FROM technical_manual
WHERE page_id < 100
ORDER BY score DESC LIMIT 3
""", (emb_pg, query_text))
# 4. Pass Context (Text + Image file_id) to Multimodal LLM
for row in context_rows:
print(f"Retrieved: {row['content']} | Image Reference: {row['file_id']}")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
# Hybrid search with metadata filtering via REST SQL
# Combines vector similarity (0.6 weight) with BM25 text match (0.4 weight)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "SELECT content, file_id, (embedding, content) <#> deeplake_hybrid_record($1::float4[], $2, 0.6, 0.4) AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"technical_manual\" WHERE page_id < 10 ORDER BY score DESC LIMIT 3",
"params": ["{0.12,0.22}", "voltage regulator"]
}'
Step-by-Step Breakdown¶
1. The "Retriever Intelligence"¶
Simple vector stores often suffer from "semantic drift" where they return documents that are conceptually similar but lack the exact keywords required. By using Deeplake's Hybrid Search (deeplake_hybrid_record), we ensure that exact terms (like "voltage regulator") are prioritized via BM25, while the vector search captures the overall meaning.
2. SQL-First Filtering¶
The true power of Deeplake is combining similarity search with structured SQL. In the example above, the WHERE page_id < 10 clause filters the data before the expensive similarity calculation occurs, significantly improving latency and precision.
3. Multimodal Context¶
Because Deeplake stores the raw images alongside the text, your RAG pipeline can pass both the text and the visual diagram to models like GPT-4o. This allows the model to "see" the circuit diagram while "reading" the troubleshooting steps.
What to try next¶
- Search Guide: tune your Hybrid Search weights (e.g., 0.8 vector vs 0.2 text).
- PDF Processing: how to auto-ingest PDF manuals into this schema.
- Image Search: focus purely on visual similarity.