Query¶
Deep Lake provides powerful query capabilities through its Tensor Query Language (TQL), with special focus on vector similarity search, text search, and operations on multidimensional arrays.
Query Functions¶
deeplake.query
¶
query(query: str, token: str | None = None) -> DatasetView
Executes TQL queries optimized for ML data filtering and search.
TQL is a SQL-like query language designed for ML datasets, supporting:
- Vector similarity search
- Text semantic search
- Complex data filtering
- Joining across datasets
- Efficient sorting and pagination
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string supporting:
|
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
Returns:
Name | Type | Description |
---|---|---|
DatasetView |
DatasetView
|
Query results that can be:
|
Examples:
Vector similarity search:
# Find similar embeddings
similar = deeplake.query('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
# Use results in training
dataloader = similar.pytorch()
Text semantic search:
# Search documents using BM25
relevant = deeplake.query('''
SELECT * FROM "mem://documents"
ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
LIMIT 10
''')
Complex filtering:
# Filter training data
train = deeplake.query('''
SELECT * FROM "mem://dataset"
WHERE "split" = 'train'
AND confidence > 0.9
AND label IN ('cat', 'dog')
''')
Joins for feature engineering:
deeplake.query_async
¶
Asynchronously executes TQL queries optimized for ML data filtering and search.
Non-blocking version of query()
for better performance with large datasets.
Supports the same TQL features including vector similarity search, text search,
filtering, and joins.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset |
Examples:
Basic async query:
# Run query asynchronously
future = deeplake.query_async('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')
# Do other work while query runs
prepare_training()
# Get results when needed
results = future.result()
With async/await:
async def search_similar():
results = await deeplake.query_async('''
SELECT * FROM "mem://images"
ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
return results
async def main():
similar = await search_similar()
Non-blocking check:
Vector Search¶
Search by vector similarity:
# Cosine similarity search
text_vector = ','.join(str(x) for x in search_vector)
results = deeplake.query(f"""
SELECT *
FROM "s3://bucket/embeddings"
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[{text_vector}]) DESC
LIMIT 100
""")
Text Search¶
Text search using BM25 or keyword matching:
# Semantic search using BM25
results = deeplake.query("""
SELECT *
FROM "s3://bucket/documents"
ORDER BY BM25_SIMILARITY(text, 'search query') DESC
LIMIT 10
""")
# Keyword search using CONTAINS
results = deeplake.query("""
SELECT *
FROM "s3://bucket/metadata"
WHERE CONTAINS(keywords, 'specific term')
""")
Array Operations¶
Operate on multidimensional arrays:
# Select specific array dimensions
results = deeplake.query("""
SELECT features[:, 0:10]
FROM "s3://bucket/features"
""")
# Filter by array values
results = deeplake.query("""
SELECT *
FROM "s3://bucket/features"
WHERE features[0] > 0.5
""")
# Aggregate array operations
results = deeplake.query("""
SELECT AVG(features, axis=0)
FROM "s3://bucket/features"
""")
Joining Datasets¶
Join data across different datasets and across different clouds:
# Join datasets from different storage
results = deeplake.query("""
SELECT i.image, i.embedding, m.labels, m.metadata
FROM "s3://bucket1/images" AS i
JOIN "s3://bucket2/metadata" AS m
ON i.id = m.image_id
WHERE m.verified = true
""")
# Complex join with filtering
results = deeplake.query("""
SELECT
i.image,
e.embedding,
l.label
FROM "s3://bucket1/images" AS i
JOIN "gcs://bucket2/embeddings" AS e ON i.id = e.image_id
JOIN "azure://container/labels" AS l ON i.id = l.image_id
WHERE l.confidence > 0.9
ORDER BY COSINE_SIMILARITY(e.embedding, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
""")
Filtering¶
Filter data using WHERE clauses:
# Simple filters
results = deeplake.query("""
SELECT *
FROM "s3://bucket/dataset"
WHERE label = 'cat'
AND confidence > 0.9
""")
# Combine with vector search
results = deeplake.query("""
SELECT *
FROM "s3://bucket/dataset"
WHERE label IN ('cat', 'dog')
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
""")
Query Results¶
Process query results:
# Iterate through results
for item in results:
image = item["images"]
label = item["label"]
# Direct column access (recommended for performance)
images = results["images"][:]
labels = results["labels"][:]
Async Queries¶
Execute queries asynchronously:
# Run query asynchronously
future = deeplake.query_async("""
SELECT *
FROM "s3://bucket/dataset"
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1,0.2, 0.3]) DESC
""")
# Get results when ready
results = future.result()
# Check completion
if future.is_completed():
results = future.result()
else:
print("Query still running")
Querying Views¶
Chain queries on views: