Skip to content

Query

Deep Lake provides powerful query capabilities through its Tensor Query Language (TQL), with special focus on vector similarity search, text search, and operations on multidimensional arrays.

Query Functions

deeplake.query

query(query: str, token: str | None = None) -> DatasetView

Executes TQL queries optimized for ML data filtering and search.

TQL is a SQL-like query language designed for ML datasets, supporting:

  • Vector similarity search
  • Text semantic search
  • Complex data filtering
  • Joining across datasets
  • Efficient sorting and pagination

Parameters:

Name Type Description Default
query str

TQL query string supporting:

  • Vector similarity: COSINE_SIMILARITY, L2_NORM
  • Text search: BM25_SIMILARITY, CONTAINS
  • MAXSIM similarity for ColPali embeddings: MAXSIM
  • Filtering: WHERE clauses
  • Sorting: ORDER BY
  • Joins: JOIN across datasets
required
token str | None

Optional Activeloop authentication token

None

Returns:

Name Type Description
DatasetView DatasetView

Query results that can be:

  • Used directly in ML training
  • Further filtered with additional queries
  • Converted to PyTorch/TensorFlow dataloaders
  • Materialized into a new dataset

Examples:

Vector similarity search:

# Find similar embeddings
similar = deeplake.query('''
    SELECT * FROM "mem://embeddings" 
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
''')

# Use results in training
dataloader = similar.pytorch()

Text semantic search:

# Search documents using BM25
relevant = deeplake.query('''
    SELECT * FROM "mem://documents"
    ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
    LIMIT 10
''')

Complex filtering:

# Filter training data
train = deeplake.query('''
    SELECT * FROM "mem://dataset"
    WHERE "split" = 'train' 
    AND confidence > 0.9
    AND label IN ('cat', 'dog')
''')

Joins for feature engineering:

# Combine image features with metadata
features = deeplake.query('''
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "mem://images" AS i
    JOIN "mem://metadata" AS m ON i.id = m.image_id
    WHERE m.verified = true
''')

deeplake.query_async

query_async(query: str, token: str | None = None) -> Future

Asynchronously executes TQL queries optimized for ML data filtering and search.

Non-blocking version of query() for better performance with large datasets. Supports the same TQL features including vector similarity search, text search, filtering, and joins.

Parameters:

Name Type Description Default
query str

TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets

required
token str | None

Optional Activeloop authentication token

None

Returns:

Name Type Description
Future Future

Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset

Examples:

Basic async query:

# Run query asynchronously
future = deeplake.query_async('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')

# Do other work while query runs
prepare_training()

# Get results when needed
results = future.result()

With async/await:

async def search_similar():
    results = await deeplake.query_async('''
        SELECT * FROM "mem://images"
        ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
        LIMIT 100
    ''')
    return results

async def main():
    similar = await search_similar()

Non-blocking check:

future = deeplake.query_async(
    "SELECT * FROM dataset WHERE \"split\" = 'train'"
)

if future.is_completed():
    train_data = future.result()
else:
    print("Query still running...")

Search by vector similarity:

# Cosine similarity search
text_vector = ','.join(str(x) for x in search_vector)
results = deeplake.query(f"""
    SELECT *
    FROM "s3://bucket/embeddings"
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[{text_vector}]) DESC
    LIMIT 100
""")

Text search using BM25 or keyword matching:

# Semantic search using BM25
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/documents"
    ORDER BY BM25_SIMILARITY(text, 'search query') DESC
    LIMIT 10
""")

# Keyword search using CONTAINS
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/metadata"
    WHERE CONTAINS(keywords, 'specific term')
""")

Array Operations

Operate on multidimensional arrays:

# Select specific array dimensions
results = deeplake.query("""
    SELECT features[:, 0:10]
    FROM "s3://bucket/features"
""")

# Filter by array values
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/features"
    WHERE features[0] > 0.5
""")

# Aggregate array operations
results = deeplake.query("""
    SELECT AVG(features, axis=0)
    FROM "s3://bucket/features"
""")

Joining Datasets

Join data across different datasets and across different clouds:

# Join datasets from different storage
results = deeplake.query("""
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "s3://bucket1/images" AS i
    JOIN "s3://bucket2/metadata" AS m 
    ON i.id = m.image_id
    WHERE m.verified = true
""")

# Complex join with filtering
results = deeplake.query("""
    SELECT 
        i.image,
        e.embedding,
        l.label
    FROM "s3://bucket1/images" AS i
    JOIN "gcs://bucket2/embeddings" AS e ON i.id = e.image_id
    JOIN "azure://container/labels" AS l ON i.id = l.image_id
    WHERE l.confidence > 0.9
    ORDER BY COSINE_SIMILARITY(e.embedding, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
""")

Filtering

Filter data using WHERE clauses:

# Simple filters
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/dataset"
    WHERE label = 'cat'
    AND confidence > 0.9
""")

# Combine with vector search
results = deeplake.query("""
    SELECT *
    FROM "s3://bucket/dataset"
    WHERE label IN ('cat', 'dog')
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
""")

Query Results

Process query results:

# Iterate through results
for item in results:
    image = item["images"]
    label = item["label"]

# Direct column access (recommended for performance)
images = results["images"][:]
labels = results["labels"][:]

Async Queries

Execute queries asynchronously:

# Run query asynchronously
future = deeplake.query_async("""
    SELECT *
    FROM "s3://bucket/dataset"
    ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1,0.2, 0.3]) DESC
""")

# Get results when ready
results = future.result()

# Check completion
if future.is_completed():
    results = future.result()
else:
    print("Query still running")

Querying Views

Chain queries on views:

# Initial query
view = deeplake.query("SELECT * FROM \"s3://bucket/animals\"")

# Query on view
cats = view.query("SELECT * WHERE species = 'cat'")

# Further filter
large_cats = cats.query('SELECT * WHERE weight > 10')