Skip to content

Query

Deep Lake provides powerful query capabilities through its Tensor Query Language (TQL), with special focus on vector similarity search, text search, and operations on multidimensional arrays.

Query Functions

deeplake.query

query(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> DatasetView

Executes TQL queries optimized for ML data filtering and search.

TQL is a SQL-like query language designed for ML datasets, supporting:

  • Vector similarity search
  • Text semantic search
  • Complex data filtering
  • Joining across datasets
  • Efficient sorting and pagination

Parameters:

Name Type Description Default
query str

TQL query string supporting:

  • Vector similarity: COSINE_SIMILARITY, L2_NORM
  • Text search: BM25_SIMILARITY, CONTAINS
  • MAXSIM similarity for ColPali embeddings: MAXSIM
  • Filtering: WHERE clauses
  • Sorting: ORDER BY
  • Joins: JOIN across datasets
required
token str | None

Optional Activeloop authentication token

None
creds dict

Dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None

Returns:

Name Type Description
DatasetView DatasetView

Query results that can be:

  • Used directly in ML training
  • Further filtered with additional queries
  • Converted to PyTorch/TensorFlow dataloaders
  • Materialized into a new dataset

Examples:

Vector similarity search:

# Find similar embeddings
similar = deeplake.query('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
''')

# Use results in training
dataloader = similar.pytorch()

Text semantic search:

# Search documents using BM25
relevant = deeplake.query('''
    SELECT * FROM "mem://documents"
    ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
    LIMIT 10
''')

Complex filtering:

# Filter training data
train = deeplake.query('''
    SELECT * FROM "mem://dataset"
    WHERE "train_split" = 'train'
    AND confidence > 0.9
    AND label IN ('cat', 'dog')
''')

Joins for feature engineering:

# Combine image features with metadata
features = deeplake.query('''
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "mem://images" AS i
    JOIN "mem://metadata" AS m ON i.id = m.image_id
    WHERE m.verified = true
''')

deeplake.query_async

query_async(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> Future[DatasetView]

Asynchronously executes TQL queries optimized for ML data filtering and search.

Non-blocking version of query() for better performance with large datasets. Supports the same TQL features including vector similarity search, text search, filtering, and joins.

Parameters:

Name Type Description Default
query str

TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets

required
token str | None

Optional Activeloop authentication token

None
creds dict

Dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None

Returns:

Name Type Description
Future Future[DatasetView]

Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset

Examples:

Basic async query:

# Run query asynchronously
future = deeplake.query_async('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')

# Do other work while query runs
prepare_training()

# Get results when needed
results = future.result()

With async/await:

async def search_similar():
    results = await deeplake.query_async('''
        SELECT * FROM "mem://images"
        ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
        LIMIT 100
        ''')
    return results
async def main():
    similar = await search_similar()

Non-blocking check:

future = deeplake.query_async(
    "SELECT * FROM dataset WHERE train_split = 'train'"
)

if future.is_completed():
    train_data = future.result()
else:
    print("Query still running...")

deeplake.prepare_query

prepare_query(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> Executor

Prepares a TQL query for execution with optional authentication.

Parameters:

Name Type Description Default
query str

TQL query string to execute

required
token str | None

Optional Activeloop authentication token

None
creds dict

Dictionary containing credentials used to access the dataset at the path.

None

Returns:

Name Type Description
Executor Executor

An executor object to run the query.

Examples:

Running a parametrized batch query:

ex = deeplake.prepare_query('SELECT * FROM "mem://parametriized" WHERE category = ?')
results = ex.run_batch([["active"], ["inactive"]])
assert len(results) == 2

deeplake.explain_query

explain_query(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> ExplainQueryResult

Explains TQL query with optional authentication.

Parameters:

Name Type Description Default
query str

TQL query string to explain

required
token str | None

Optional Activeloop authentication token

None
creds dict

Dictionary containing credentials used to access the dataset at the path.

None

Returns:

Name Type Description
ExplainQueryResult ExplainQueryResult

An explain result object to analyze the query.

Examples:

Explaining a query:

explain_result = deeplake.explain_query('SELECT * FROM "mem://explain_query" WHERE category == 'active'')
print(explain_result)

Query Classes

Executor

Prepared query execution object.

deeplake.Executor

get_query_string
get_query_string() -> str
run_single
run_single() -> DatasetView
run_single_async
run_single_async() -> Future[DatasetView]
run_batch
run_batch(parameters: list = None) -> list
run_batch_async
run_batch_async(
    parameters: list = None,
) -> Future[list[DatasetView]]

ExplainQueryResult

Query explanation and analysis result.

deeplake.ExplainQueryResult

to_dict
to_dict() -> Any

Search by vector similarity:

# Cosine similarity search
text_vector = ','.join(str(x) for x in search_vector)
results = deeplake.query(f"""
    SELECT *
    FROM "s3://bucket/embeddings"
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[{text_vector}]) DESC
    LIMIT 100
""")

Text search using BM25 or keyword matching:

# Semantic search using BM25
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/documents"
    ORDER BY BM25_SIMILARITY(text, 'search query') DESC
    LIMIT 10
""")

# Keyword search using CONTAINS
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/metadata"
    WHERE CONTAINS(keywords, 'specific term')
""")

Array Operations

Operate on multidimensional arrays:

# Select specific array dimensions
results = deeplake.query("""
    SELECT features[:, 0:10]
    FROM "s3://bucket/features"
""")

# Filter by array values
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/features"
    WHERE features[0] > 0.5
""")

# Aggregate array operations
results = deeplake.query("""
    SELECT AVG(features, axis=0)
    FROM "s3://bucket/features"
""")

Joining Datasets

Join data across different datasets and across different clouds:

# Join datasets from different storage
results = deeplake.query("""
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "s3://bucket1/images" AS i
    JOIN "s3://bucket2/metadata" AS m 
    ON i.id = m.image_id
    WHERE m.verified = true
""")

# Complex join with filtering
results = deeplake.query("""
    SELECT 
        i.image,
        e.embedding,
        l.label
    FROM "s3://bucket1/images" AS i
    JOIN "gcs://bucket2/embeddings" AS e ON i.id = e.image_id
    JOIN "azure://container/labels" AS l ON i.id = l.image_id
    WHERE l.confidence > 0.9
    ORDER BY COSINE_SIMILARITY(e.embedding, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
""")

Filtering

Filter data using WHERE clauses:

# Simple filters
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/dataset"
    WHERE label = 'cat'
    AND confidence > 0.9
""")

# Combine with vector search
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/dataset"
    WHERE label IN ('cat', 'dog')
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
""")

Query Results

Process query results:

# Iterate through results
for item in results:
    image = item["images"]
    label = item["label"]

# Direct column access (recommended for performance)
images = results["images"][:]
labels = results["labels"][:]

Async Queries

Execute queries asynchronously:

# Run query asynchronously
future = deeplake.query_async("""
    SELECT *
    FROM "s3://bucket/dataset"
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1,0.2, 0.3]) DESC
""")

# Get results when ready
results = future.result()

# Check completion
if future.is_completed():
    results = future.result()
else:
    print("Query still running")

Querying Views

Chain queries on views:

# Initial query
view = deeplake.query("SELECT * FROM \"s3://bucket/animals\"")

# Query on view
cats = view.query("SELECT * WHERE species = 'cat'")

# Further filter
large_cats = cats.query('SELECT * WHERE weight > 10')

Prepared Queries

Prepare queries for reuse with different parameters:

# Prepare a parameterized query
s_executor = deeplake.prepare_query("""
    SELECT *
    FROM "s3://bucket/dataset"
    WHERE label = 'cat'
    AND confidence > 0.5
""")

m_executor = deeplake.prepare_query("""
    SELECT * FROM "s3://bucket/dataset"
    WHERE label = ? AND confidence > ?
""")

# Execute with different parameters
cats_high = s_executor.run_single()

# Batch execution
results = m_executor.run_batch([
    ["cat", 0.9],
    ["dog", 0.8],
    ["bird", 0.7]
])

# Async execution
future = m_executor.run_single_async(["cat", 0.95])
result = future.result()

# Get the query string
print(f"Query: {m_executor.get_query_string()}")

Query Explanation

Analyze query execution plans:

# Explain query performance
explanation = deeplake.explain_query("""
    SELECT * FROM "s3://bucket/large_dataset"
    WHERE category = 'cat'
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 1000
""")

# Print explanation
print(explanation)

# Get explanation as dictionary
explain_dict = explanation.to_dict()
print(f"Execution plan: {explain_dict}")

# Use explanation to optimize queries
if "index_used" in explain_dict:
    print("Query will use indexes for optimization")