Migrating to Deeplake v4¶
Deeplake v4 replaces the v3 Python SDK with a managed service based on SQL. Key changes:
| v3 | v4 | |
|---|---|---|
| Interface | deeplake.load(), tensor API |
Client(), SQL, ingest() |
| Storage | hub://org/dataset |
Managed tables (no bucket config) |
| Search | ds.search(), TQL |
<#> operator (vector, BM25, hybrid) |
| Schema | Tensors with htype |
SQL columns with types |
| Install | pip install deeplake (v3) |
pip install deeplake (v4) |
Step 1: Install v4¶
v3 and v4 cannot coexist in the same environment. If you need v3 access, keep a separate environment.
Step 2: Set up credentials¶
See Authentication for details.
Step 3: Migrate your data¶
From v3 datasets (automatic)¶
Use deeplake.convert() to convert a v3 dataset to v4 format, then ingest into a managed table:
import deeplake
# Convert v3 dataset to v4 format
deeplake.convert(
src="hub://org_name/v3_dataset",
dst="al://org_name/v4_dataset"
)
# Open the converted dataset and ingest into a managed table
ds = deeplake.open("al://org_name/v4_dataset")
From v3 datasets (manual)¶
Read v3 data in a separate v3 environment, export to files or dicts, then ingest in v4:
# In v4 environment: ingest from exported data
client.ingest("my_table", {
"text": ["doc 1 content", "doc 2 content", "doc 3 content"],
"embedding": [embedding_1, embedding_2, embedding_3],
})
From files¶
# Ingest files directly: video, images, PDFs, text
client.ingest("documents", {
"path": ["report.pdf", "notes.txt", "slides.pdf"]
}, schema={"path": "FILE"})
From other databases¶
Export your data as column dictionaries and ingest:
# Example: from a pandas DataFrame
import pandas as pd
df = pd.read_sql("SELECT text, metadata FROM old_table", connection)
client.ingest("migrated_table", {
"text": df["text"].tolist(),
"metadata": df["metadata"].tolist(),
})
Step 4: Create indexes¶
v3 search required no index setup. In v4, create indexes for the columns you want to search:
# Vector search on embeddings
client.create_index("my_table", "embedding")
# BM25 keyword search on text
client.create_index("my_table", "text")
Or via SQL:
client.query("""
CREATE INDEX ON "YOUR_WORKSPACE"."my_table"
USING deeplake_index (embedding DESC)
""")
See Indexes for all index types.
Step 5: Update your queries¶
v3 → v4 query mapping¶
Step 6: Update training pipelines¶
# v4: open_table returns a deeplake.Dataset with full PyTorch/TF support
ds = client.open_table("training_data")
# PyTorch
from torch.utils.data import DataLoader
loader = DataLoader(ds.pytorch(), batch_size=32, shuffle=True)
# TensorFlow
tf_ds = ds.tensorflow().batch(32)
API mapping reference¶
| v3 | v4 |
|---|---|
deeplake.load("hub://org/ds") |
client.open_table("table") |
deeplake.dataset("s3://...") |
client.ingest("table", data) |
ds.text[0:10].numpy() |
ds["text"][0:10] |
ds.search(embedding, k=10) |
client.query("... <#> ... LIMIT 10") |
ds.filter("label == 'cat'") |
client.query("... WHERE label = 'cat'") |
ds.create_tensor("col", htype="image") |
Column typed as IMAGE in schema |
ds.pytorch(transform=fn) |
ds.pytorch() with DataLoader |
TQL COSINE_SIMILARITY(...) |
<#> operator |
Next steps¶
- Quickstart: get started with v4 in 2 minutes
- Search: vector, BM25, and hybrid search in v4
- Semantic Search: end-to-end search example with v4 API
- Hybrid RAG: RAG pipeline using v4 SQL queries