Query¶

Deep Lake provides powerful query capabilities through its Tensor Query Language (TQL), with special focus on vector similarity search, text search, and operations on multidimensional arrays.

Query Functions¶

deeplake.query ¶

query(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> DatasetView

Executes TQL queries optimized for ML data filtering and search.

TQL is a SQL-like query language designed for ML datasets, supporting:

Vector similarity search
Text semantic search
Complex data filtering
Joining across datasets
Efficient sorting and pagination

Parameters:

Name	Type	Description	Default
`query`	`str`	TQL query string supporting: Vector similarity: COSINE_SIMILARITY, L2_NORM Text search: BM25_SIMILARITY, CONTAINS MAXSIM similarity for ColPali embeddings: MAXSIM Filtering: WHERE clauses Sorting: ORDER BY Joins: JOIN across datasets	required
`token`	`str \| None`	Optional Activeloop authentication token	`None`
`creds`	`dict`	Dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`

Returns:

Name	Type	Description
`DatasetView`	`DatasetView`	Query results that can be: Used directly in ML training Further filtered with additional queries Converted to PyTorch/TensorFlow dataloaders Materialized into a new dataset

Examples:

Vector similarity search:

# Find similar embeddings
similar = deeplake.query('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
''')

# Use results in training
dataloader = similar.pytorch()

Text semantic search:

# Search documents using BM25
relevant = deeplake.query('''
    SELECT * FROM "mem://documents"
    ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
    LIMIT 10
''')

Complex filtering:

# Filter training data
train = deeplake.query('''
    SELECT * FROM "mem://dataset"
    WHERE "train_split" = 'train'
    AND confidence > 0.9
    AND label IN ('cat', 'dog')
''')

Joins for feature engineering:

# Combine image features with metadata
features = deeplake.query('''
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "mem://images" AS i
    JOIN "mem://metadata" AS m ON i.id = m.image_id
    WHERE m.verified = true
''')

deeplake.query_async ¶

query_async(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> Future

Asynchronously executes TQL queries optimized for ML data filtering and search.

Non-blocking version of query() for better performance with large datasets. Supports the same TQL features including vector similarity search, text search, filtering, and joins.

Parameters:

Name	Type	Description	Default
`query`	`str`	TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets	required
`token`	`str \| None`	Optional Activeloop authentication token	`None`
`creds`	`dict`	Dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`

Returns:

Name	Type	Description
`Future`	`Future`	Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset

Examples:

Basic async query:

# Run query asynchronously
future = deeplake.query_async('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')

# Do other work while query runs
prepare_training()

# Get results when needed
results = future.result()

With async/await:

async def search_similar():
    results = await deeplake.query_async('''
        SELECT * FROM "mem://images"
        ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
        LIMIT 100
    ''')
    return results

async def main():
    similar = await search_similar()

Non-blocking check:

future = deeplake.query_async(
    "SELECT * FROM dataset WHERE train_split = 'train'"
)

if future.is_completed():
    train_data = future.result()
else:
    print("Query still running...")

Vector Search¶

Search by vector similarity:

# Cosine similarity search
text_vector = ','.join(str(x) for x in search_vector)
results = deeplake.query(f"""
    SELECT *
    FROM "s3://bucket/embeddings"
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[{text_vector}]) DESC
    LIMIT 100
""")

Text Search¶

Text search using BM25 or keyword matching:

# Semantic search using BM25
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/documents"
    ORDER BY BM25_SIMILARITY(text, 'search query') DESC
    LIMIT 10
""")

# Keyword search using CONTAINS
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/metadata"
    WHERE CONTAINS(keywords, 'specific term')
""")

Array Operations¶

Operate on multidimensional arrays:

# Select specific array dimensions
results = deeplake.query("""
    SELECT features[:, 0:10]
    FROM "s3://bucket/features"
""")

# Filter by array values
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/features"
    WHERE features[0] > 0.5
""")

# Aggregate array operations
results = deeplake.query("""
    SELECT AVG(features, axis=0)
    FROM "s3://bucket/features"
""")

Joining Datasets¶

Join data across different datasets and across different clouds:

# Join datasets from different storage
results = deeplake.query("""
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "s3://bucket1/images" AS i
    JOIN "s3://bucket2/metadata" AS m 
    ON i.id = m.image_id
    WHERE m.verified = true
""")

# Complex join with filtering
results = deeplake.query("""
    SELECT 
        i.image,
        e.embedding,
        l.label
    FROM "s3://bucket1/images" AS i
    JOIN "gcs://bucket2/embeddings" AS e ON i.id = e.image_id
    JOIN "azure://container/labels" AS l ON i.id = l.image_id
    WHERE l.confidence > 0.9
    ORDER BY COSINE_SIMILARITY(e.embedding, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
""")

Filtering¶

Filter data using WHERE clauses:

# Simple filters
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/dataset"
    WHERE label = 'cat'
    AND confidence > 0.9
""")

# Combine with vector search
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/dataset"
    WHERE label IN ('cat', 'dog')
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
""")

Query Results¶

Process query results:

# Iterate through results
for item in results:
    image = item["images"]
    label = item["label"]

# Direct column access (recommended for performance)
images = results["images"][:]
labels = results["labels"][:]

Async Queries¶

Execute queries asynchronously:

# Run query asynchronously
future = deeplake.query_async("""
    SELECT *
    FROM "s3://bucket/dataset"
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1,0.2, 0.3]) DESC
""")

# Get results when ready
results = future.result()

# Check completion
if future.is_completed():
    results = future.result()
else:
    print("Query still running")

Querying Views¶

Chain queries on views:

# Initial query
view = deeplake.query("SELECT * FROM \"s3://bucket/animals\"")

# Query on view
cats = view.query("SELECT * WHERE species = 'cat'")

# Further filter
large_cats = cats.query('SELECT * WHERE weight > 10')