Skip to content

Querying Data

Deeplake provides a powerful query language called TQL (Table Query Language) that allows you to query datasets in a SQL-like manner.

Full documentation on TQL syntax can be found here.

Single-Dataset Query

deeplake.Dataset.query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

deeplake.Dataset.query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

Cross-Dataset Query

deeplake.query

query(query: str, token: str | None = None) -> DatasetView

Executes TQL queries optimized for ML data filtering and search.

TQL is a SQL-like query language designed for ML datasets, supporting: - Vector similarity search - Text semantic search - Complex data filtering - Joining across datasets - Efficient sorting and pagination

Parameters:

Name Type Description Default
query str

TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets

required
token str | None

Optional Activeloop authentication token

None

Returns:

Name Type Description
DatasetView DatasetView

Query results that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset

Examples:

Vector similarity search:

# Find similar embeddings
similar = deeplake.query('''
    SELECT * FROM "mem://embeddings" 
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
''')

# Use results in training
dataloader = similar.pytorch()

Text semantic search:

# Search documents using BM25
relevant = deeplake.query('''
    SELECT * FROM "mem://documents"
    ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
    LIMIT 10
''')

Complex filtering:

# Filter training data
train = deeplake.query('''
    SELECT * FROM "mem://dataset"
    WHERE "split" = 'train' 
    AND confidence > 0.9
    AND label IN ('cat', 'dog')
''')

Joins for feature engineering:

# Combine image features with metadata
features = deeplake.query('''
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "mem://images" AS i
    JOIN "mem://metadata" AS m ON i.id = m.image_id
    WHERE m.verified = true
''')

deeplake.query_async

query_async(query: str, token: str | None = None) -> Future

Asynchronously executes TQL queries optimized for ML data filtering and search.

Non-blocking version of query() for better performance with large datasets. Supports the same TQL features including vector similarity search, text search, filtering, and joins.

Parameters:

Name Type Description Default
query str

TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets

required
token str | None

Optional Activeloop authentication token

None

Returns:

Name Type Description
Future Future

Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset

Examples:

Basic async query:

# Run query asynchronously
future = deeplake.query_async('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')

# Do other work while query runs
prepare_training()

# Get results when needed
results = future.result()

With async/await:

async def search_similar():
    results = await deeplake.query_async('''
        SELECT * FROM "mem://images"
        ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
        LIMIT 100
    ''')
    return results

async def main():
    similar = await search_similar()

Non-blocking check:

future = deeplake.query_async(
    "SELECT * FROM dataset WHERE \"split\" = 'train'"
)

if future.is_completed():
    train_data = future.result()
else:
    print("Query still running...")

Custom TQL Functions

deeplake.tql.register_function

register_function(function: Callable) -> None

Registers the given function in TQL, to be used in queries. TQL interacts with Python functions through numpy.ndarray. The Python function to be used in TQL should accept input arguments as numpy arrays and return numpy array.

Examples:

def next_number(a):
    return a + 1

deeplake.tql.register_function(next_number)

r = ds.query("SELECT * WHERE next_number(column_name) > 10")