Querying Data
Deeplake provides a powerful query language called TQL (Table Query Language) that allows you to query datasets in a SQL-like manner.
Full documentation on TQL syntax can be found here.
Single-Dataset Query
deeplake.Dataset.query
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
deeplake.Dataset.query_async
query_async(query: str) -> Future
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
Cross-Dataset Query
deeplake.query
query(query: str, token: str | None = None) -> DatasetView
Executes TQL queries optimized for ML data filtering and search.
TQL is a SQL-like query language designed for ML datasets, supporting: - Vector similarity search - Text semantic search - Complex data filtering - Joining across datasets - Efficient sorting and pagination
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
Returns:
Name | Type | Description |
---|---|---|
DatasetView |
DatasetView
|
Query results that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset |
Examples:
Vector similarity search:
# Find similar embeddings
similar = deeplake.query('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
# Use results in training
dataloader = similar.pytorch()
Text semantic search:
# Search documents using BM25
relevant = deeplake.query('''
SELECT * FROM "mem://documents"
ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
LIMIT 10
''')
Complex filtering:
# Filter training data
train = deeplake.query('''
SELECT * FROM "mem://dataset"
WHERE "split" = 'train'
AND confidence > 0.9
AND label IN ('cat', 'dog')
''')
Joins for feature engineering:
deeplake.query_async
query_async(query: str, token: str | None = None) -> Future
Asynchronously executes TQL queries optimized for ML data filtering and search.
Non-blocking version of query() for better performance with large datasets. Supports the same TQL features including vector similarity search, text search, filtering, and joins.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset |
Examples:
Basic async query:
# Run query asynchronously
future = deeplake.query_async('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')
# Do other work while query runs
prepare_training()
# Get results when needed
results = future.result()
With async/await:
async def search_similar():
results = await deeplake.query_async('''
SELECT * FROM "mem://images"
ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
return results
async def main():
similar = await search_similar()
Non-blocking check:
Custom TQL Functions
deeplake.tql.register_function
Registers the given function in TQL, to be used in queries.
TQL interacts with Python functions through numpy.ndarray
. The Python function
to be used in TQL should accept input arguments as numpy arrays and return numpy array.
Examples: