Skip to content

PDF Processing

Turn a collection of PDFs into a visually searchable library. Deeplake renders each page as a high-resolution image, stores visual embeddings alongside the page data, and lets you search with natural language queries like "chart showing quarterly revenue".

Objective

Ingest PDFs into Deeplake, generate visual embeddings for each page, and retrieve pages by semantic visual similarity.

Prerequisites

  • Deeplake SDK: pip install deeplake
  • PDF + AI stack: pip install pymupdf torch transformers pillow accelerate
  • A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code

import os
import io
import uuid
import torch
import fitz  # pymupdf
from PIL import Image
from deeplake import Client
from transformers import AutoModel, AutoProcessor

# ── 1. Setup visual encoder (same as Image Search) ────────
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-4b"
device = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = AutoModel.from_pretrained(
    MODEL_ID, trust_remote_code=True, torch_dtype=torch.bfloat16, device_map=device
).eval()

def embed_image(pil_image):
    inputs = processor.process_images(images=[pil_image])
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.inference_mode():
        out = model(**inputs)
    return out.embeddings[0].cpu().float().numpy().tolist()

def embed_text(text):
    inputs = processor.process_texts(texts=[text])
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.inference_mode():
        out = model(**inputs)
    return out.embeddings[0].cpu().float().numpy().tolist()

# ── 2. Extract page images from PDFs ──────────────────────
pdf_files = ["user_manual_v1.pdf", "safety_guide.pdf"]

client = Client()

# Create table with multi-vector embedding column
client.query("""
    CREATE TABLE IF NOT EXISTS "pdf_library" (
        _id UUID PRIMARY KEY,
        image IMAGE,
        embedding FLOAT4[][],
        filename TEXT,
        page_index INT8
    ) USING deeplake
""")

ds = client.open_table("pdf_library")

for pdf_path in pdf_files:
    doc = fitz.open(pdf_path)
    batch = {
        "_id": [], "image": [], "embedding": [],
        "filename": [], "page_index": [],
    }
    for page_num in range(len(doc)):
        page = doc[page_num]

        # Render page as 300 DPI image
        pix = page.get_pixmap(dpi=300)
        img_bytes = pix.tobytes("jpeg")
        pil_img = Image.open(io.BytesIO(img_bytes)).convert("RGB")

        batch["_id"].append(str(uuid.uuid4()))
        batch["image"].append(img_bytes)
        batch["embedding"].append(embed_image(pil_img))
        batch["filename"].append(os.path.basename(pdf_path))
        batch["page_index"].append(page_num)

    doc.close()
    ds.append(batch)
    print(f"Appended {len(batch['_id'])} pages from {pdf_path}")

ds.commit()

# Create vector index on embeddings
client.create_index("pdf_library", "embedding")

# ── 3. Search pages by visual similarity ──────────────────
query = "chart showing quarterly revenue"
query_emb = embed_text(query)

# Format multi-vector as PG array literal: {{v1},{v2},...}
emb_pg = "{" + ",".join(
    "{" + ",".join(str(v) for v in row) + "}" for row in query_emb
) + "}"

results = client.query(f"""
    SELECT filename, page_index,
           embedding <#> '{emb_pg}'::float4[][] AS score
    FROM pdf_library ORDER BY score DESC LIMIT 5
""")

for r in results:
    print(f"  {r['filename']} page {r['page_index']} (score: {r['score']:.4f})")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
TABLE="pdf_library"

# 1. Create table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (id SERIAL PRIMARY KEY, filename TEXT, page_index INT4, embedding FLOAT4[][]) USING deeplake"
  }'

# 2. Insert page with embedding (compute embeddings locally)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (filename, page_index, embedding) VALUES ($1, 0, $2::float4[][])",
    "params": ["user_manual_v1.pdf", "{0.1,0.2,0.3}"]
  }'

# 3. Search by embedding
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "SELECT filename, page_index, embedding <#> $1::float4[][] AS score FROM \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" ORDER BY score DESC LIMIT 5",
    "params": ["{0.1,0.2,0.3}"]
  }'

Step-by-Step Breakdown

1. Page Rendering

PyMuPDF renders each PDF page as a 300 DPI JPEG. This captures text, diagrams, tables, and figures as a single image, preserving the visual layout that text extraction would lose.

page = doc[page_num]
pix = page.get_pixmap(dpi=300)
img_bytes = pix.tobytes("jpeg")

2. Visual Embeddings

Each page image is encoded into a multi-vector embedding using ColQwen3 (same model as Image Search). ColQwen3 produces float4[][] (one vector per visual token), enabling late-interaction retrieval. A page with a bar chart will match queries like "revenue chart", even if the text on the page doesn't contain those exact words.

3. Schema

Column Type Description
image IMAGE JPEG-encoded page rendering (300 DPI)
embedding FLOAT4[][] Multi-vector visual embedding from ColQwen3
filename TEXT Source PDF filename
page_index INT64 Page number (0-indexed)

The <#> operator computes cosine similarity between the query text embedding and each page's visual embedding. This lets you search PDFs the way you'd flip through them: by what the page looks like, not just what text it contains.

Use cases:

  • "page with the network architecture diagram"
  • "table comparing model accuracy"
  • "flowchart of the approval process"
  • "photo of the circuit board"

What to try next