PDF Processing¶
Turn a collection of PDFs into a visually searchable library. Deeplake renders each page as a high-resolution image, stores visual embeddings alongside the page data, and lets you search with natural language queries like "chart showing quarterly revenue".
Objective¶
Ingest PDFs into Deeplake, generate visual embeddings for each page, and retrieve pages by semantic visual similarity.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - PDF + AI stack:
pip install pymupdf torch transformers pillow accelerate - A Deeplake API token.
Set credentials first
Complete Code¶
import os
import io
import uuid
import torch
import fitz # pymupdf
from PIL import Image
from deeplake import Client
from transformers import AutoModel, AutoProcessor
# ── 1. Setup visual encoder (same as Image Search) ────────
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(
MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device
).eval()
def embed_image(pil_image):
inputs = processor.process_images(images=[pil_image])
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.inference_mode():
out = model(**inputs)
return out.embeddings[0].cpu().float().numpy().tolist()
def embed_text(text):
inputs = processor.process_texts(texts=[text])
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.inference_mode():
out = model(**inputs)
return out.embeddings[0].cpu().float().numpy().tolist()
# ── 2. Extract page images from PDFs ──────────────────────
pdf_files = ["user_manual_v1.pdf", "safety_guide.pdf"]
client = Client()
# Create table with multi-vector embedding column
client.query("""
CREATE TABLE IF NOT EXISTS "pdf_library" (
_id UUID PRIMARY KEY,
image IMAGE,
embedding FLOAT4[][],
filename TEXT,
page_index INT8
) USING deeplake
""")
ds = client.open_table("pdf_library")
for pdf_path in pdf_files:
doc = fitz.open(pdf_path)
batch = {
"_id": [], "image": [], "embedding": [],
"filename": [], "page_index": [],
}
for page_num in range(len(doc)):
page = doc[page_num]
# Render page as 300 DPI image
pix = page.get_pixmap(dpi=300)
img_bytes = pix.tobytes("jpeg")
pil_img = Image.open(io.BytesIO(img_bytes)).convert("RGB")
batch["_id"].append(str(uuid.uuid4()))
batch["image"].append(img_bytes)
batch["embedding"].append(embed_image(pil_img))
batch["filename"].append(os.path.basename(pdf_path))
batch["page_index"].append(page_num)
doc.close()
ds.append(batch)
print(f"Appended {len(batch['_id'])} pages from {pdf_path}")
ds.commit()
# Create vector index on embeddings
client.create_index("pdf_library", "embedding")
# ── 3. Search pages by visual similarity ──────────────────
query = "chart showing quarterly revenue"
query_emb = embed_text(query)
# Format multi-vector as PG array literal: {{v1},{v2},...}
emb_pg = "{" + ",".join(
"{" + ",".join(str(v) for v in row) + "}" for row in query_emb
) + "}"
results = client.query(f"""
SELECT filename, page_index,
embedding <#> '{emb_pg}'::float4[][] AS score
FROM pdf_library ORDER BY score DESC LIMIT 5
""")
for r in results:
print(f" {r['filename']} page {r['page_index']} (score: {r['score']:.4f})")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
TABLE="pdf_library"
# 1. Create table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (id SERIAL PRIMARY KEY, filename TEXT, page_index INT4, embedding FLOAT4[][]) USING deeplake"
}'
# 2. Insert page with embedding (compute embeddings locally)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (filename, page_index, embedding) VALUES ($1, 0, $2::float4[][])",
"params": ["user_manual_v1.pdf", "{0.1,0.2,0.3}"]
}'
# 3. Search by embedding
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "SELECT filename, page_index, embedding <#> $1::float4[][] AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" ORDER BY score DESC LIMIT 5",
"params": ["{0.1,0.2,0.3}"]
}'
Step-by-Step Breakdown¶
1. Page Rendering¶
PyMuPDF renders each PDF page as a 300 DPI JPEG. This captures text, diagrams, tables, and figures as a single image, preserving the visual layout that text extraction would lose.
2. Visual Embeddings¶
Each page image is encoded into a multi-vector embedding using ColQwen3 (same model as Image Search). ColQwen3 produces float4[][] (one vector per visual token), enabling late-interaction retrieval. A page with a bar chart will match queries like "revenue chart", even if the text on the page doesn't contain those exact words.
3. Schema¶
| Column | Type | Description |
|---|---|---|
image |
IMAGE | JPEG-encoded page rendering (300 DPI) |
embedding |
FLOAT4[][] | Multi-vector visual embedding from ColQwen3 |
filename |
TEXT | Source PDF filename |
page_index |
INT64 | Page number (0-indexed) |
4. Visual Search¶
The <#> operator computes cosine similarity between the query text embedding and each page's visual embedding. This lets you search PDFs the way you'd flip through them: by what the page looks like, not just what text it contains.
Use cases:
- "page with the network architecture diagram"
- "table comparing model accuracy"
- "flowchart of the approval process"
- "photo of the circuit board"
What to try next¶
- Image Search: the same visual embedding approach for standalone images.
- Advanced Multimodal RAG: combine visual and text retrieval.
- Massive Ingestion: tune performance for large PDF collections.