Skip to content

Quickstart Guide

Get started with Deep Lake by following these examples.

Installation

Deep Lake can be installed using pip:

pip install deeplake

Creating a Dataset

import deeplake

# Create a local dataset
ds = deeplake.create("path/to/dataset")

# Or create in cloud storage
ds = deeplake.create("s3://my-bucket/dataset")
ds = deeplake.create("gcs://my-bucket/dataset")
ds = deeplake.create("azure://container/dataset")

Adding Data

Add columns to store different types of data:

# Add basic data types
ds.add_column("ids", "int32")
ds.add_column("labels", "text")

# Add specialized data types
ds.add_column("images", deeplake.types.Image())
ds.add_column("embeddings", deeplake.types.Embedding(768))
ds.add_column("masks", deeplake.types.BinaryMask())

Insert data into the dataset:

# Add single samples
ds.append([{
    "ids": 1,
    "labels": "cat",
    "images": image_array,
    "embeddings": embedding_vector,
    "masks": mask_array
}])

# Add batches of data
ds.append({
    "ids": [1, 2, 3],
    "labels": ["cat", "dog", "bird"],
    "images": batch_of_images,
    "embeddings": batch_of_embeddings,
    "masks": batch_of_masks
})

Accessing Data

Access individual samples:

# Get single items
image = ds["images"][0]
label = ds["labels"][0]
embedding = ds["embeddings"][0]

# Get ranges
images = ds["images"][0:100]
labels = ds["labels"][0:100]

# Get specific indices
selected_images = ds["images"][[0, 2, 3]]

Search by embedding similarity:

# Find similar items
text_vector = ','.join(str(x) for x in search_vector)
results = ds.query(f"""
    SELECT *
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[{text_vector}]) DESC
    LIMIT 100
""")

# Process results - Method 1: iterate through items
for item in results:
    image = item["images"]
    label = item["labels"]

# Process results - Method 2: direct column access
images = results["images"][:]
labels = results["labels"][:]  # Recommended for better performance

Data Versioning

# Commit changes
ds.commit("Added initial data")

# Create version tag
ds.tag("v1.0")

# View history
for version in ds.history:
    print(version.id, version.message)

Async Operations

Use async operations for better performance:

# Async data loading
future = ds["images"].get_async(slice(0, 1000))
images = future.result()

# Async query
future = ds.query_async(
    "SELECT * WHERE labels = 'cat'"
)
cats = future.result()

Next Steps

Support

If you encounter any issues:

  1. Check our GitHub Issues
  2. Join our Slack Community