Skip to content

TQL Syntax

Overview

Deep Lake offers a performant SQL-based query engine called "TQL" (Tensor Query Language) optimized for machine learning and AI workloads. TQL combines familiar SQL syntax with powerful tensor operations, enabling efficient querying of embeddings, images, and other multi-modal data.

Basic Usage

Dataset Queries

TQL can be used directly on a dataset or across multiple datasets:

# Query on a single dataset (no FROM needed)
ds = deeplake.open("al://org_name/dataset_name")
result = ds.query("SELECT * WHERE id > 10")

# Query across datasets (requires FROM)
result = deeplake.query('SELECT * FROM "al://my_org/dataset_name" WHERE id > 10')

Query Syntax

String Values

String literals must use single quotes:

SELECT * WHERE contains(column_name, 'text_value')

Special Characters

Column or dataset names with special characters need double quotes:

SELECT * WHERE contains("column-name", 'text_value')
SELECT * FROM "al://my_org/dataset" WHERE id > 10

Tip

When writing queries in Python, remember to properly escape quotes:

# Using escape characters
query = "SELECT * WHERE contains(\"column-name\", 'text_value')"

# Using different quote types
query = 'SELECT * WHERE contains("column-name", \'text_value\')'

# Using triple quotes
query = """
    SELECT * WHERE contains("column-name", 'text_value')
"""

Vector Operations

TQL provides multiple methods for vector similarity search:

-- Cosine similarity (higher is more similar)
SELECT * 
ORDER BY COSINE_SIMILARITY(embeddings, ARRAY[0.1, 0.2, ...]) DESC
LIMIT 100

-- L2 norm/Euclidean distance (lower is more similar)
SELECT * 
ORDER BY L2_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100

-- L1 norm/Manhattan distance
SELECT * 
ORDER BY L1_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100

-- L∞ norm/Chebyshev distance
SELECT * 
ORDER BY LINF_NORM(embeddings - ARRAY[0.1, 0.2, ...]) ASC
LIMIT 100

TQL supports the MAXSIM operator for efficient similarity search with ColPali embeddings:

-- Using MAXSIM with ColPali embeddings
SELECT *, MAXSIM(
    document_embeddings,
    ARRAY[
        ARRAY[0.1, 0.2, 0.3],
        ARRAY[0.4, 0.5, 0.6],
        ARRAY[0.7, 0.8, 0.9]
    ]
) AS score
ORDER BY MAXSIM(
    document_embeddings,
    ARRAY[
        ARRAY[0.1, 0.2, 0.3],
        ARRAY[0.4, 0.5, 0.6],
        ARRAY[0.7, 0.8, 0.9]
    ]
) DESC
LIMIT 10

JSON and Dictionary Operations

TQL provides powerful operations for querying JSON and dictionary-like data:

-- Access JSON fields
SELECT metadata['timestamp'] as time, metadata['author'] as author
WHERE metadata['status'] = 'published'

-- Query nested JSON structures
SELECT * WHERE config['model']['type'] = 'transformer'

-- Array indexing in JSON
SELECT data[0] as first_item, data[-1] as last_item

-- Get all keys from JSON object
SELECT KEYS(metadata) as available_fields

-- Check if key exists
SELECT * WHERE ANY(KEYS(metadata) == 'timestamp')

-- Filter by JSON field values
SELECT * WHERE metadata['score'] > 0.8 AND metadata['verified'] = true

JSON Access Syntax

  • Use square brackets ['key'] for string keys
  • Use square brackets [0] for array indices
  • Negative indices work for arrays: [-1] gets the last element

Semantic Search with BM25

-- Find semantically similar text
SELECT *
ORDER BY BM25_SIMILARITY(text_column, 'search query text') DESC
LIMIT 10
-- Find exact keyword matches (requires Inverted index)
SELECT * WHERE CONTAINS(text_column, 'keyword')

-- Find whole text matches (requires Exact index)  
SELECT * WHERE text_column = 'exact full text match'
-- Find exact text matches (requires Exact index)
SELECT * WHERE EQUALS(text_column, 'exact text match')

Numeric Queries

Numeric columns with indexes support efficient comparison operations:

-- Range queries (requires Numeric Inverted index)
SELECT * WHERE numeric_column > 100
SELECT * WHERE numeric_column >= 50 AND numeric_column <= 200
SELECT * WHERE numeric_column BETWEEN 10 AND 50

-- Value list matching
SELECT * WHERE numeric_column IN (10, 20, 30, 40)

Pattern Matching with LIKE

TQL supports SQL-style pattern matching with wildcards:

-- Using wildcards (* for any characters, ? for single character)
SELECT * WHERE text_column LIKE '*apple*'  -- Contains 'apple'
SELECT * WHERE text_column LIKE 'data?'    -- Matches 'data' followed by one char
SELECT * WHERE filename LIKE '*.jpg'       -- Ends with .jpg

-- Exact match with LIKE
SELECT * WHERE category LIKE 'product'

-- Case-insensitive pattern matching with ILIKE
SELECT * WHERE text_column ILIKE '*APPLE*'  -- Matches 'apple', 'Apple', 'APPLE', etc.
SELECT * WHERE email ILIKE '*@gmail.com'    -- Case-insensitive email domain match
SELECT * WHERE title ILIKE 'chapter?'       -- Matches 'Chapter1', 'chapter2', 'CHAPTER3', etc.

Note

  • TQL uses * for multiple characters and ? for single character wildcards, similar to shell globbing patterns
  • LIKE is case-sensitive, while ILIKE is case-insensitive

Advanced Features

Cross-Cloud Dataset Joins

TQL enables joining datasets across different cloud storage providers:

-- Join datasets from different storage providers
SELECT 
    i.image,
    i.embedding,
    m.labels,
    m.metadata
FROM "s3://bucket1/images" AS i
JOIN "gcs://bucket2/metadata" AS m 
    ON i.id = m.image_id
WHERE m.verified = true
ORDER BY COSINE_SIMILARITY(i.embedding, ARRAY[...]) DESC

Virtual Columns

Create computed columns on the fly:

-- Compute similarity scores
SELECT *,
    COSINE_SIMILARITY(embedding, ARRAY[...]) as similarity_score
FROM dataset
ORDER BY similarity_score DESC

-- Complex computations
SELECT *,
    column_1 + column_3 as sum,
    any(boxes[:,0]) < 0 as box_beyond_image
WHERE label = 'person'

Subqueries

TQL supports nested queries for complex data processing:

-- Basic subquery
SELECT * FROM (
    SELECT * WHERE labels < 4
) WHERE labels > 2

-- Subquery with computed columns
SELECT * FROM (
    SELECT *, COSINE_SIMILARITY(embedding, data(embedding, 1)) as score
) ORDER BY score DESC LIMIT 10

-- Multiple subqueries with joins
SELECT *
FROM (SELECT * FROM "s3://bucket/dataset1" WHERE verified = true) AS t1
JOIN (SELECT * FROM "gcs://bucket/dataset2" WHERE active = true) AS t2
ON t1.id = t2.user_id

Parameterized Queries

Use placeholders for dynamic query execution:

# Define parameterized query with ? placeholders
query = "SELECT * WHERE age > ? AND category = ? AND score < ?"

# Execute with parameters
# result = ds.query(query, parameters=[25, 'premium', 0.9])

# Parameterized sampling
# query = "SELECT * SAMPLE BY sum_weight(labels == ?: 10, True: 1) LIMIT ?"
# result = ds.query(query, parameters=[5, 1000])

Query Parameters

Use ? as placeholders in your query and pass actual values via the parameters argument. This improves performance by allowing query plan reuse.

Logical Operations

-- Combining conditions
SELECT * 
WHERE (contains(text, 'machine learning')
    AND confidence > 0.9)
    OR label IN ('cat', 'dog')

-- Array operations
SELECT * 
WHERE any(logical_and(
    bounding_boxes[:,3] > 0.5,
    confidence > 0.8
))

Data Sampling

-- Weighted random sampling
SELECT * 
SAMPLE BY MAX_WEIGHT(
    high_confidence: 0.7,
    medium_confidence: 0.2,
    low_confidence: 0.1
) LIMIT 1000

-- Sampling with replacement
SELECT * 
SAMPLE BY MAX_WEIGHT(
    positive_samples: 0.5,
    negative_samples: 0.5
) replace True LIMIT 2000

Set Operations

TQL supports combining results from multiple queries:

-- Union: Combine results from multiple queries
(SELECT * WHERE labels == 1 LIMIT 10)
UNION
(SELECT * WHERE labels == 2 LIMIT 10)

-- Union with different conditions
(SELECT * WHERE category = 'training' LIMIT 1000)
UNION
(SELECT * WHERE category = 'validation' LIMIT 200)

Data Expansion

Expand image or tensor data with optional overlap control:

-- Expand data by specified dimensions
SELECT * WHERE labels == 0
EXPAND BY 2 2 as expanded_data

-- Expand with overlap enabled
SELECT * WHERE category = 'images'
EXPAND BY 4 4 OVERLAP true as overlapped_tiles

-- Example: Create image patches for training
SELECT image, labels
EXPAND BY 224 224 as patches
WHERE dataset_split = 'train'

Image Tiling

EXPAND BY is particularly useful for creating sliding window patches from large images for computer vision tasks.

Grouping and Sequences

-- Group frames into videos
SELECT *
GROUP BY video_id, camera_id

-- Group by multiple columns
SELECT *
GROUP BY category, subcategory

-- Split videos into frames
SELECT *
UNGROUP BY split

Built-in Functions

Array Operations

  • SHAPE(array): Returns array dimensions

    SELECT * WHERE SHAPE(embedding)[0] = 768
    SELECT * WHERE SHAPE(boxes)[0] > 10  -- More than 10 bounding boxes
    

  • DATA(column, index): Access specific array elements for comparison

    SELECT * ORDER BY L2_NORM(embedding - data(embedding, 10))
    

  • NONZERO(array): Returns indices of non-zero elements

    -- Returns rows where the first non-zero element is less than 10
    SELECT * WHERE NONZERO(scores)[0] < 10
    

Row Information

  • ROW_NUMBER(): Returns zero-based row offset
    SELECT *, ROW_NUMBER() WHERE ROW_NUMBER() < 100
    SELECT *, ROW_NUMBER() as id  -- Create virtual ID column
    

Array Logic

  • ANY(condition[, axis]): True if any element satisfies condition

    SELECT * WHERE ANY(confidence > 0.9)
    SELECT * WHERE ANY(boxes[:,0] < 0)  -- Any box beyond left edge
    

  • ALL(condition[, axis]): True if all elements satisfy condition

    SELECT * WHERE ALL(scores > 0.5)
    SELECT * WHERE ALL(pixels < 255)
    

  • ALL_STRICT(condition): Stricter version of ALL (returns false for empty arrays)

    SELECT * WHERE ALL_STRICT(values > 0)
    SELECT * WHERE ALL_STRICT(boxes[:,3] > 200)
    

  • LOGICAL_AND(array1, array2): Element-wise logical AND

    SELECT * WHERE ANY(LOGICAL_AND(confidence > 0.8, area > 100))
    

  • LOGICAL_OR(array1, array2): Element-wise logical OR

    SELECT * WHERE ANY(LOGICAL_OR(type == 'car', type == 'truck'))
    

Aggregations

  • SUM(array): Sum of all elements

    SELECT SUM(prices) as total WHERE category = 'electronics'
    

  • AVG(array): Average of all elements

    SELECT category, AVG(scores) as avg_score GROUP BY category
    

  • PROD(array): Product of all elements

    SELECT PROD(dimensions) as volume
    

  • AMIN(array): Minimum value

    SELECT AMIN(confidence) WHERE label = 'person'
    

  • AMAX(array): Maximum value

    SELECT AMAX(confidence) WHERE label = 'person'
    

Mathematical Functions

  • SQRT(value): Square root

    SELECT * WHERE SQRT(area) > 10
    

  • ABS(value): Absolute value

    SELECT * WHERE ABS(temperature - 20) < 5
    

Vector Distance Functions

  • DOT(vector1, vector2): Dot product of two vectors

    SELECT *, DOT(embedding, ARRAY[0.1, 0.2, ...]) as dot_score
    ORDER BY dot_score DESC
    

  • HAMMING_DISTANCE(vector1, vector2): Hamming distance between vectors

    SELECT * ORDER BY HAMMING_DISTANCE(binary_vector, ARRAY[0, 1, 1, 0]) ASC
    

JSON/Dictionary Operations

  • KEYS(json_object): Returns keys from JSON object
    SELECT KEYS(metadata) as available_fields
    SELECT * WHERE ANY(KEYS(metadata) == 'timestamp')
    

Random Numbers

  • RANDOM(): Returns a random 32-bit integer number
    -- Shuffle the dataset
    SELECT * ORDER BY RANDOM()
    
    -- Random sampling with ordering
    SELECT * WHERE RANDOM() % 10 == 0  -- ~10% random sample
    

Custom Functions

TQL supports registering custom Python functions:

# Define and register custom function
def custom_square(a):
    return a * a

deeplake.tql.register_function(custom_square)

# Use in query
results = ds.query("SELECT * WHERE custom_square(column_name) > 10")

Custom functions must:

  • Accept numpy arrays as input
  • Return numpy arrays as output
  • Be registered before use in queries

Query Syntax Reference

Operators

Arithmetic Operators

SELECT column_1 + column_2 as sum
SELECT column_1 - column_2 as diff
SELECT column_1 * column_2 as product
SELECT column_1 / column_2 as ratio
SELECT value % 10 as modulo

Comparison Operators

SELECT * WHERE value == 5    -- or value = 5
SELECT * WHERE value != 5
SELECT * WHERE value > 5
SELECT * WHERE value >= 5
SELECT * WHERE value < 5
SELECT * WHERE value <= 5
SELECT * WHERE value BETWEEN 1 AND 10
SELECT * WHERE value IN (1, 2, 3, 4, 5)

Logical Operators

SELECT * WHERE condition1 AND condition2
SELECT * WHERE condition1 OR condition2
SELECT * WHERE NOT condition

Array Indexing and Slicing

-- Single element access
SELECT column[0] as first_element
SELECT column[-1] as last_element

-- Slicing (Python-style)
SELECT column[0:10] as first_ten
SELECT column[10:] as from_tenth
SELECT column[:10] as up_to_tenth
SELECT column[::2] as every_other

-- Multi-dimensional arrays
SELECT boxes[:,0] as x_coords
SELECT matrix[0:5, 0:3] as submatrix
SELECT * WHERE any(boxes[:,3] > 200)  -- Height > 200

Join Types

-- Inner join (only matching rows)
SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id

-- Left join (all from left, matching from right)
SELECT * FROM table1 LEFT JOIN table2 ON table1.id = table2.id

-- Right join (all from right, matching from left)
SELECT * FROM table1 RIGHT JOIN table2 ON table1.id = table2.id

-- Full join (all from both)
SELECT * FROM table1 FULL JOIN table2 ON table1.id = table2.id

-- Cross join (cartesian product)
SELECT * FROM table1 CROSS JOIN table2

-- Using clause (when column names match)
SELECT * FROM table1 JOIN table2 USING(id)

Query Modifiers

-- Limit results
SELECT * LIMIT 100
SELECT * LIMIT 0.1 PERCENT  -- Get 0.1% of results

-- Order results
SELECT * ORDER BY column ASC
SELECT * ORDER BY column DESC
SELECT * ORDER BY column1 ASC, column2 DESC

-- Sampling
SELECT * SAMPLE BY 0.1  -- 10% random sample
SELECT * SAMPLE BY labels  -- Sample by label distribution
SELECT * SAMPLE BY labels REPLACE true  -- With replacement

Index Creation for Optimal Performance

Text Indexes

Create indexes on text columns to enable efficient text search:

# Create column with Inverted index for keyword search
ds.add_column("description", deeplake.types.Text(deeplake.types.Inverted))

# Create column with BM25 index for semantic search
ds.add_column("content", deeplake.types.Text(deeplake.types.BM25))

# Add both Inverted and BM25 indexes to different columns
ds.add_column("document", deeplake.types.Text(deeplake.types.Inverted))
ds.add_column("article", deeplake.types.Text())
ds["article"].create_index(deeplake.types.TextIndex(deeplake.types.BM25))

# Create index on new Text column
ds.add_column("existing_text", deeplake.types.Text())
ds["existing_text"].create_index(deeplake.types.TextIndex(deeplake.types.Inverted))

Usage with queries:

-- Keyword search (requires Inverted index)
SELECT * WHERE CONTAINS(description, 'machine learning')

-- Semantic search (requires BM25 index)
SELECT * ORDER BY BM25_SIMILARITY(content, 'deep learning tutorial') DESC LIMIT 10

-- Pattern matching (requires Inverted index)
SELECT * WHERE description LIKE '*neural network*'

Numeric Indexes

Create inverted indexes on numeric columns for efficient range queries:

# Create column with Numeric Inverted index
ds.add_column("age", deeplake.types.UInt64())
ds["age"].create_index(
    deeplake.types.NumericIndex(deeplake.types.Inverted)
)

# Or on existing column - first create the column
ds.add_column("price", deeplake.types.Float64())
ds["price"].create_index(
    deeplake.types.NumericIndex(deeplake.types.Inverted)
)

Usage with queries:

-- Range queries (requires Numeric Inverted index)
SELECT * WHERE age BETWEEN 18 AND 65
SELECT * WHERE price > 100 AND price <= 1000

-- Value list matching
SELECT * WHERE category_id IN (1, 3, 5, 7, 9)

Embedding Indexes

For vector similarity search on embeddings:

# Create Embedding column with Clustered index (default)
ds.add_column("embedding", deeplake.types.Embedding(size=768))

# Create with Quantized index for faster search with larger datasets
ds.add_column(
    "embedding_quantized",
    deeplake.types.Embedding(
        size=768,
        index_type=deeplake.types.EmbeddingIndex(deeplake.types.ClusteredQuantized)
    )
)

# Add Clustered index to existing Array column - first create the column
ds.add_column("vectors", deeplake.types.Array(dtype=deeplake.types.Float32(), shape=(128,)))
ds["vectors"].create_index(
    deeplake.types.EmbeddingIndex(deeplake.types.Clustered)
)

Usage with queries:

-- Cosine similarity search
SELECT * ORDER BY COSINE_SIMILARITY(embedding, ARRAY[...]) DESC LIMIT 10

-- Combined filtering and similarity search
SELECT * WHERE category = 'products'
ORDER BY COSINE_SIMILARITY(embedding, ARRAY[...]) DESC
LIMIT 20

Managing Indexes

# Check existing indexes
print(ds["column_name"].indexes)

# Drop an index - use columns we created earlier
ds["article"].drop_index(deeplake.types.TextIndex(deeplake.types.BM25))
ds["age"].drop_index(
    deeplake.types.NumericIndex(deeplake.types.Inverted)
)

# Commit changes
ds.commit()

Index Best Practices

  • Inverted indexes are essential for CONTAINS, IN LIKE, and range queries
  • BM25 indexes enable semantic text search via BM25_SIMILARITY
  • Create indexes before ingesting large amounts of data when possible
  • Use ClusteredQuantized for embeddings when dataset size > 10M rows
  • Indexes are automatically maintained as you add/update data