Skip to content

Migrating to Deeplake v4

Deeplake v4 replaces the v3 Python SDK with a managed service based on SQL. Key changes:

v3 v4
Interface deeplake.load(), tensor API Client(), SQL, ingest()
Storage hub://org/dataset Managed tables (no bucket config)
Search ds.search(), TQL <#> operator (vector, BM25, hybrid)
Schema Tensors with htype SQL columns with types
Install pip install deeplake (v3) pip install deeplake (v4)

Step 1: Install v4

pip install --upgrade deeplake

v3 and v4 cannot coexist in the same environment. If you need v3 access, keep a separate environment.

Step 2: Set up credentials

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"
from deeplake import Client

client = Client()

See Authentication for details.

Step 3: Migrate your data

From v3 datasets (automatic)

Use deeplake.convert() to convert a v3 dataset to v4 format, then ingest into a managed table:

import deeplake

# Convert v3 dataset to v4 format
deeplake.convert(
    src="hub://org_name/v3_dataset",
    dst="al://org_name/v4_dataset"
)

# Open the converted dataset and ingest into a managed table
ds = deeplake.open("al://org_name/v4_dataset")

From v3 datasets (manual)

Read v3 data in a separate v3 environment, export to files or dicts, then ingest in v4:

# In v4 environment: ingest from exported data
client.ingest("my_table", {
    "text": ["doc 1 content", "doc 2 content", "doc 3 content"],
    "embedding": [embedding_1, embedding_2, embedding_3],
})

From files

# Ingest files directly: video, images, PDFs, text
client.ingest("documents", {
    "path": ["report.pdf", "notes.txt", "slides.pdf"]
}, schema={"path": "FILE"})

From other databases

Export your data as column dictionaries and ingest:

# Example: from a pandas DataFrame
import pandas as pd

df = pd.read_sql("SELECT text, metadata FROM old_table", connection)
client.ingest("migrated_table", {
    "text": df["text"].tolist(),
    "metadata": df["metadata"].tolist(),
})

Step 4: Create indexes

v3 search required no index setup. In v4, create indexes for the columns you want to search:

# Vector search on embeddings
client.create_index("my_table", "embedding")

# BM25 keyword search on text
client.create_index("my_table", "text")

Or via SQL:

client.query("""
    CREATE INDEX ON "YOUR_WORKSPACE"."my_table"
    USING deeplake_index (embedding DESC)
""")

See Indexes for all index types.

Step 5: Update your queries

v3 → v4 query mapping

# v3
results = ds.search(query_embedding, k=10)

# v4
emb_pg = "{" + ",".join(str(x) for x in query_embedding) + "}"
results = client.query(f"""
    SELECT *, embedding <#> '{emb_pg}'::float4[] AS score
    FROM my_table ORDER BY score DESC LIMIT 10
""")
# v3 (TQL)
view = ds.filter("text CONTAINS 'error'")

# v4 (BM25)
results = client.query("""
    SELECT *, content <#> 'error' AS score
    FROM my_table ORDER BY score DESC LIMIT 10
""")
# v3
ds = deeplake.load("hub://org/dataset")
text = ds.text[0:10].numpy()

# v4 (fluent API)
results = client.table("my_table").select("text").limit(10)()

# v4 (dataset access for training)
ds = client.open_table("my_table")
text = ds["text"][0:10]

Step 6: Update training pipelines

# v4: open_table returns a deeplake.Dataset with full PyTorch/TF support
ds = client.open_table("training_data")

# PyTorch
from torch.utils.data import DataLoader
loader = DataLoader(ds.pytorch(), batch_size=32, shuffle=True)

# TensorFlow
tf_ds = ds.tensorflow().batch(32)

API mapping reference

v3 v4
deeplake.load("hub://org/ds") client.open_table("table")
deeplake.dataset("s3://...") client.ingest("table", data)
ds.text[0:10].numpy() ds["text"][0:10]
ds.search(embedding, k=10) client.query("... <#> ... LIMIT 10")
ds.filter("label == 'cat'") client.query("... WHERE label = 'cat'")
ds.create_tensor("col", htype="image") Column typed as IMAGE in schema
ds.pytorch(transform=fn) ds.pytorch() with DataLoader
TQL COSINE_SIMILARITY(...) <#> operator

Next steps