Skip to content

TQL Syntax

Overview

Deep Lake offers a performant SQL-based query engine called "TQL" (Tensor Query Language) optimized for machine learning and AI workloads. TQL combines familiar SQL syntax with powerful tensor operations, enabling efficient querying of embeddings, images, and other multi-modal data.

Basic Usage

Dataset Queries

TQL can be used directly on a dataset or across multiple datasets:

# Query on a single dataset (no FROM needed)
ds = deeplake.open("al://org_name/dataset_name")
result = ds.query("SELECT * WHERE id > 10")

# Query across datasets (requires FROM)
result = deeplake.query('SELECT * FROM "al://my_org/dataset_name" WHERE id > 10')

Query Syntax

String Values

String literals must use single quotes:

SELECT * WHERE contains(column_name, 'text_value')

Special Characters

Column or dataset names with special characters need double quotes:

SELECT * WHERE contains("column-name", 'text_value')
SELECT * FROM "al://my_org/dataset" WHERE id > 10

Tip

When writing queries in Python, remember to properly escape quotes:

# Using escape characters
query = "SELECT * WHERE contains(\"column-name\", 'text_value')"

# Using different quote types
query = 'SELECT * WHERE contains("column-name", \'text_value\')'

# Using triple quotes
query = """
    SELECT * WHERE contains("column-name", 'text_value')
"""

Vector Operations

TQL provides multiple methods for vector similarity search:

-- Cosine similarity (higher is more similar)
SELECT * 
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, ...]) DESC
LIMIT 100

-- L2 norm/Euclidean distance (lower is more similar)
SELECT * 
ORDER BY L2_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100

-- L1 norm/Manhattan distance
SELECT * 
ORDER BY L1_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100

-- L∞ norm/Chebyshev distance
SELECT * 
ORDER BY LINF_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100

TQL supports the MAXSIM operator for efficient similarity search with ColPali embeddings:

-- Using MAXSIM with ColPali embeddings
SELECT *, MAXSIM(
    document_embeddings,
    ARRAY[
        ARRAY[0.1, 0.2, 0.3],
        ARRAY[0.4, 0.5, 0.6],
        ARRAY[0.7, 0.8, 0.9]
    ]
) AS score
ORDER BY MAXSIM(
    document_embeddings,
    ARRAY[
        ARRAY[0.1, 0.2, 0.3],
        ARRAY[0.4, 0.5, 0.6],
        ARRAY[0.7, 0.8, 0.9]
    ]
) DESC
LIMIT 10

Semantic Search with BM25

-- Find semantically similar text
SELECT * 
ORDER BY BM25_SIMILARITY(text_column, 'search query text') DESC
LIMIT 10
-- Find exact keyword matches
SELECT * WHERE CONTAINS(text_column, 'keyword')

Advanced Features

Cross-Cloud Dataset Joins

TQL enables joining datasets across different cloud storage providers:

-- Join datasets from different storage providers
SELECT 
    i.image,
    i.embedding,
    m.labels,
    m.metadata
FROM "s3://bucket1/images" AS i
JOIN "gcs://bucket2/metadata" AS m 
    ON i.id = m.image_id
WHERE m.verified = true
ORDER BY COSINE_SIMILARITY(i.embedding, ARRAY[...]) DESC

Virtual Columns

Create computed columns on the fly:

-- Compute similarity scores
SELECT *, 
    COSINE_SIMILARITY(embedding, ARRAY[...]) as similarity_score
FROM dataset
ORDER BY similarity_score DESC

-- Complex computations
SELECT *,
    column_1 + column_3 as sum,
    any(boxes[:,0]) < 0 as box_beyond_image
WHERE label = 'person'

Logical Operations

-- Combining conditions
SELECT * 
WHERE (contains(text, 'machine learning')
    AND confidence > 0.9)
    OR label IN ('cat', 'dog')

-- Array operations
SELECT * 
WHERE any(logical_and(
    bounding_boxes[:,3] > 0.5,
    confidence > 0.8
))

Data Sampling

-- Weighted random sampling
SELECT * 
SAMPLE BY MAX_WEIGHT(
    high_confidence: 0.7,
    medium_confidence: 0.2,
    low_confidence: 0.1
) LIMIT 1000

-- Sampling with replacement
SELECT * 
SAMPLE BY MAX_WEIGHT(
    positive_samples: 0.5,
    negative_samples: 0.5
) replace True LIMIT 2000

Grouping and Sequences

-- Group frames into videos
SELECT * 
GROUP BY video_id, camera_id

-- Split videos into frames
SELECT * 
UNGROUP BY split

Built-in Functions

Array Operations

  • SHAPE(array): Returns array dimensions

    SELECT * WHERE SHAPE(embedding)[0] = 768
    

  • DATA(column, index): Access specific array elements

    SELECT * ORDER BY L2_NORM(embedding - data(embedding, 10))
    

Row Information

  • ROW_NUMBER(): Returns zero-based row offset
    SELECT *, ROW_NUMBER() WHERE ROW_NUMBER() < 100
    

Array Logic

  • ANY(), ALL(): Array-wise logical operations

    SELECT * WHERE any(confidence > 0.9)
    SELECT * WHERE all(scores > 0.5)
    

  • ALL_STRICT(): Stricter version of ALL

    -- Returns false for empty arrays
    SELECT * WHERE all_strict(values > 0)
    

Custom Functions

TQL supports registering custom Python functions:

# Define and register custom function
def custom_square(a):
    return a * a

deeplake.tql.register_function(custom_square)

# Use in query
results = ds.query("SELECT * WHERE custom_square(column_name) > 10")

Custom functions must:

  • Accept numpy arrays as input
  • Return numpy arrays as output
  • Be registered before use in queries