Indexes¶
Indexes make search fast. You create them once on a column. Queries use them automatically.
Deeplake supports three index types, all created via USING deeplake_index.
Prerequisite: USING deeplake
Tables must be created with USING deeplake for indexes to work. Without the USING deeplake engine clause on the table, index creation will fail.
Setup¶
Set DEEPLAKE_API_KEY and DEEPLAKE_WORKSPACE as environment variables (see Quickstart).
Set credentials first
Vector index¶
For similarity search on embedding columns. Deeplake supports two embedding formats:
| Column type | Algorithm | How it works |
|---|---|---|
FLOAT4[] |
Cosine similarity | Single vector per row. Computes cosine distance between the query vector and each row's embedding. Best for text embeddings, image embeddings, or any model that produces one vector per item. |
FLOAT4[][] |
MaxSim | Bag of vectors per row (e.g. one vector per token/patch). For each query vector, finds the best-matching vector in the row, then sums those scores. Used by ColBERT-style late-interaction models for higher-quality retrieval. |
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE INDEX IF NOT EXISTS idx_docs_vec ON \"YOUR_WORKSPACE\".\"documents\" USING deeplake_index (embedding DESC)"
}'
Enables the <#> operator for vector similarity:
SELECT *, embedding <#> ARRAY[0.1, 0.2, ...]::float4[] AS score
FROM "my_workspace"."documents" ORDER BY score DESC LIMIT 10
BM25 index¶
For keyword search on text columns.
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE INDEX IF NOT EXISTS idx_docs_bm25 ON \"YOUR_WORKSPACE\".\"documents\" USING deeplake_index (content) WITH (index_type = '\''bm25'\'')"
}'
Enables the <#> operator for text ranking:
SELECT *, content <#> 'authentication error' AS score
FROM "my_workspace"."documents" ORDER BY score DESC LIMIT 10
Exact text index¶
For fast exact string filtering.
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE INDEX IF NOT EXISTS idx_docs_category ON \"YOUR_WORKSPACE\".\"documents\" USING deeplake_index (category) WITH (index_type = '\''exact_text'\'')"
}'
When to use each¶
| Index type | Column type | Use case |
|---|---|---|
| Vector (cosine) | FLOAT4[] |
"Find similar items": semantic search with single-vector embeddings |
| Vector (MaxSim) | FLOAT4[][] |
"Find similar items": late-interaction retrieval (ColBERT-style) with multi-vector embeddings |
| BM25 | TEXT |
"Find exact keywords": error codes, function names, IDs |
| Exact text | TEXT |
"Filter by category": fast equality checks |
Multiple indexes on one table¶
You can have all three on the same table:
TICKETS_TABLE = "tickets"
# Vector index for semantic search
client.query(f"""
CREATE INDEX idx_vec ON "{WORKSPACE}"."{TICKETS_TABLE}"
USING deeplake_index (embedding DESC)
""")
# BM25 index for keyword search
client.query(f"""
CREATE INDEX idx_bm25 ON "{WORKSPACE}"."{TICKETS_TABLE}"
USING deeplake_index (description)
WITH (index_type = 'bm25')
""")
# Exact text index for filtering
client.query(f"""
CREATE INDEX idx_status ON "{WORKSPACE}"."{TICKETS_TABLE}"
USING deeplake_index (status)
WITH (index_type = 'exact_text')
""")
# Vector index for semantic search
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE INDEX idx_vec ON \"YOUR_WORKSPACE\".\"tickets\" USING deeplake_index (embedding DESC)"
}'
# BM25 index for keyword search
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE INDEX idx_bm25 ON \"YOUR_WORKSPACE\".\"tickets\" USING deeplake_index (description) WITH (index_type = '\''bm25'\'')"
}'
# Exact text index for filtering
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE INDEX idx_status ON \"YOUR_WORKSPACE\".\"tickets\" USING deeplake_index (status) WITH (index_type = '\''exact_text'\'')"
}'
This combination enables hybrid search. See Search.
Notes¶
- Index creation is a one-time cost. It pays back on every query.
DESCon vector indexes means higher scores are better matches.- Indexes are built asynchronously for large tables. The query returns immediately.
Next steps¶
- Search: use your indexes with vector, BM25, hybrid, and multi-vector search
- Semantic Search: end-to-end semantic search example
- Image Search: visual similarity search with embeddings
- Hybrid RAG: combine vector + BM25 indexes for RAG