TQL Syntax¶
Overview¶
Deep Lake offers a performant SQL-based query engine called "TQL" (Tensor Query Language) optimized for machine learning and AI workloads. TQL combines familiar SQL syntax with powerful tensor operations, enabling efficient querying of embeddings, images, and other multi-modal data.
Basic Usage¶
Dataset Queries¶
TQL can be used directly on a dataset or across multiple datasets:
# Query on a single dataset (no FROM needed)
ds = deeplake.open("al://org_name/dataset_name")
result = ds.query("SELECT * WHERE id > 10")
# Query across datasets (requires FROM)
result = deeplake.query('SELECT * FROM "al://my_org/dataset_name" WHERE id > 10')
Query Syntax¶
String Values¶
String literals must use single quotes:
Special Characters¶
Column or dataset names with special characters need double quotes:
SELECT * WHERE contains("column-name", 'text_value')
SELECT * FROM "al://my_org/dataset" WHERE id > 10
Tip
When writing queries in Python, remember to properly escape quotes:
Vector Operations¶
Similarity Search¶
TQL provides multiple methods for vector similarity search:
-- Cosine similarity (higher is more similar)
SELECT *
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, ...]) DESC
LIMIT 100
-- L2 norm/Euclidean distance (lower is more similar)
SELECT *
ORDER BY L2_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100
-- L1 norm/Manhattan distance
SELECT *
ORDER BY L1_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100
-- L∞ norm/Chebyshev distance
SELECT *
ORDER BY LINF_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100
ColPali MAXSIM Search¶
TQL supports the MAXSIM operator for efficient similarity search with ColPali embeddings:
-- Using MAXSIM with ColPali embeddings
SELECT *, MAXSIM(
document_embeddings,
ARRAY[
ARRAY[0.1, 0.2, 0.3],
ARRAY[0.4, 0.5, 0.6],
ARRAY[0.7, 0.8, 0.9]
]
) AS score
ORDER BY MAXSIM(
document_embeddings,
ARRAY[
ARRAY[0.1, 0.2, 0.3],
ARRAY[0.4, 0.5, 0.6],
ARRAY[0.7, 0.8, 0.9]
]
) DESC
LIMIT 10
Text Search¶
Semantic Search with BM25¶
-- Find semantically similar text
SELECT *
ORDER BY BM25_SIMILARITY(text_column, 'search query text') DESC
LIMIT 10
Keyword Search¶
Advanced Features¶
Cross-Cloud Dataset Joins¶
TQL enables joining datasets across different cloud storage providers:
-- Join datasets from different storage providers
SELECT
i.image,
i.embedding,
m.labels,
m.metadata
FROM "s3://bucket1/images" AS i
JOIN "gcs://bucket2/metadata" AS m
ON i.id = m.image_id
WHERE m.verified = true
ORDER BY COSINE_SIMILARITY(i.embedding, ARRAY[...]) DESC
Virtual Columns¶
Create computed columns on the fly:
-- Compute similarity scores
SELECT *,
COSINE_SIMILARITY(embedding, ARRAY[...]) as similarity_score
FROM dataset
ORDER BY similarity_score DESC
-- Complex computations
SELECT *,
column_1 + column_3 as sum,
any(boxes[:,0]) < 0 as box_beyond_image
WHERE label = 'person'
Logical Operations¶
-- Combining conditions
SELECT *
WHERE (contains(text, 'machine learning')
AND confidence > 0.9)
OR label IN ('cat', 'dog')
-- Array operations
SELECT *
WHERE any(logical_and(
bounding_boxes[:,3] > 0.5,
confidence > 0.8
))
Data Sampling¶
-- Weighted random sampling
SELECT *
SAMPLE BY MAX_WEIGHT(
high_confidence: 0.7,
medium_confidence: 0.2,
low_confidence: 0.1
) LIMIT 1000
-- Sampling with replacement
SELECT *
SAMPLE BY MAX_WEIGHT(
positive_samples: 0.5,
negative_samples: 0.5
) replace True LIMIT 2000
Grouping and Sequences¶
-- Group frames into videos
SELECT *
GROUP BY video_id, camera_id
-- Split videos into frames
SELECT *
UNGROUP BY split
Built-in Functions¶
Array Operations¶
-
SHAPE(array)
: Returns array dimensions -
DATA(column, index)
: Access specific array elements
Row Information¶
ROW_NUMBER()
: Returns zero-based row offset
Array Logic¶
-
ANY()
,ALL()
: Array-wise logical operations -
ALL_STRICT()
: Stricter version of ALL
Custom Functions¶
TQL supports registering custom Python functions:
# Define and register custom function
def custom_square(a):
return a * a
deeplake.tql.register_function(custom_square)
# Use in query
results = ds.query("SELECT * WHERE custom_square(column_name) > 10")
Custom functions must:
- Accept numpy arrays as input
- Return numpy arrays as output
- Be registered before use in queries