Quickstart Guide¶
Get started with Deep Lake by following these examples.
Installation¶
Deep Lake can be installed using pip:
Creating a Dataset¶
import deeplake
# Create a local dataset
ds = deeplake.create("path/to/dataset")
# Or create in cloud storage
ds = deeplake.create("s3://my-bucket/dataset")
ds = deeplake.create("gcs://my-bucket/dataset")
ds = deeplake.create("azure://container/dataset")
Adding Data¶
Add columns to store different types of data:
# Add basic data types
ds.add_column("ids", "int32")
ds.add_column("labels", "text")
# Add specialized data types
ds.add_column("images", deeplake.types.Image())
ds.add_column("embeddings", deeplake.types.Embedding(768))
ds.add_column("masks", deeplake.types.BinaryMask())
Insert data into the dataset:
# Add single samples
ds.append([{
"ids": 1,
"labels": "cat",
"images": image_array,
"embeddings": embedding_vector,
"masks": mask_array
}])
# Add batches of data
ds.append({
"ids": [1, 2, 3],
"labels": ["cat", "dog", "bird"],
"images": batch_of_images,
"embeddings": batch_of_embeddings,
"masks": batch_of_masks
})
Accessing Data¶
Access individual samples:
# Get single items
image = ds["images"][0]
label = ds["labels"][0]
embedding = ds["embeddings"][0]
# Get ranges
images = ds["images"][0:100]
labels = ds["labels"][0:100]
# Get specific indices
selected_images = ds["images"][[0, 2, 3]]
Vector Search¶
Search by embedding similarity:
# Find similar items
text_vector = ','.join(str(x) for x in search_vector)
results = ds.query(f"""
SELECT *
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[{text_vector}]) DESC
LIMIT 100
""")
# Process results - Method 1: iterate through items
for item in results:
image = item["images"]
label = item["labels"]
# Process results - Method 2: direct column access
images = results["images"][:]
labels = results["labels"][:] # Recommended for better performance
Data Versioning¶
# Commit changes
ds.commit("Added initial data")
# Create version tag
ds.tag("v1.0")
# View history
for version in ds.history:
print(version.id, version.message)
Async Operations¶
Use async operations for better performance:
# Async data loading
future = ds["images"].get_async(slice(0, 1000))
images = future.result()
# Async query
future = ds.query_async(
"SELECT * WHERE labels = 'cat'"
)
cats = future.result()
Next Steps¶
- Explore RAG applications
- Check out Deep Learning integration
Support¶
If you encounter any issues:
- Check our GitHub Issues
- Join our Slack Community