TQL Syntax

Overview

Deep Lake offers a performant SQL-based query engine for data analysis called "TQL" (Tensor Query Language).

TQL can be used directly against a dataset with ds.query(), or using the deeplake.query() method to span datasets.

When calling query() directly a dataset, there is no need to specify the "FROM" clause.

ds = deeplake.open("al://org_name/dataset_name")
result = ds.query("SELECT * WHERE id > 10")

However, when calling the static deeplake.query() on a dataset, the "FROM" clause is required, specifying the URL of the dataset(s) to query.

result = deeplake.query('SELECT * FROM "al://my_org/dataset_name" WHERE id > 10')

Quoting

String values should be wrapped with single quotes:

SELECT * WHERE contains(column_name, 'text_value')

Column or dataset names with special characters should be wrapped with double-quotes:

SELECT * WHERE contains("column-name", 'text_value')

SELECT * FROM "al://my_org/dataset" WHERE id > 10

Tip

Make sure to wrap double-quotes with escape characters in Python, or use different string quoting:

ds = deeplake.query("SELECT * WHERE contains(\"column-name\", 'text_value')")

ds = deeplake.query('SELECT * WHERE contains("column-name", \'text_value\')')

ds = deeplake.query("""
    SELECT * WHERE contains("column-name", 'text_value')
""")

Virtual Columns

Virtual columns are the result of a computation and are not columns in the Deep Lake dataset. However, they can be treated as columns in the API.

-- "score" is a virtual column
SELECT * FROM (SELECT column_1, column_2, <VECTOR_OPERATION> as score)
    ORDER BY score DESC LIMIT 10

-- "box_beyond_image" is a virtual column
SELECT *, any(boxes[:,0])<0 as box_beyond_image WHERE ....

-- "column_sum" is a virtual column
-- When combining embedding search with filtering (where conditions),
-- the filter condition is evaluated prior to the embedding search.
SELECT *, column_1 + column_3 as column_sum WHERE ......

Keywords

Most common SQL keywords are supported in TQL, including:

AND, OR, NOT

Standard logical operators

SELECT * WHERE contains(column_name, 'text_value')
    and NOT contains(column_name_2, numeric_value)

SELECT * WHERE contains(column_name, 'text_value')
    or column_name_2 == numeric_value

SELECT * WHERE (contains(column_name, 'text_value')
    and shape(column_name_2)[dimension_index]>numeric_value)
    or contains(column_name, 'text_value_2')

JOIN

With Deep Lake, you can use JOIN to combine data from two datasets regardless of where they live.

SELECT * FROM "s3://bucket/users" AS users JOIN "az://other_bucket/likes" AS likes on users.id == likes.user_id

UNION and INTERSECT

UNION and INTERSECT are used to combine or intersect the results of two queries

(SELECT * WHERE contains(column_name, 'value'))
    intersect (SELECT * WHERE contains(column_name, 'value_2'))

(SELECT * WHERE contains(column_name, 'value') LIMIT 100)
    union (SELECT * WHERE shape(column_name)[0] > numeric_value LIMIT 100)

SAMPLE BY

SELECT * SAMPLE BY weight_choice(
    expression_1: weight_1,
    expression_2: weight_2, ...)
    replace True LIMIT N

weight_choice resolves the weight that is used when multiple expressions evaluate to True for a given sample. Options are max_weight, sum_weight. For example, if weight_choice is max_weight, then the maximum weight will be chosen for that sample.
replace determines whether samples should be drawn with replacement. It defaults to True.
limit specifies the number of samples that should be returned. If unspecified, the sampler will return the number of samples corresponding to the length of the dataset

ORDER BY

ORDER BY requires that sample is numeric and has 1 value, i.e. no lists or multi-dimensional arrays

The default order is ascending (ASC)

SELECT * WHERE contains(column_name, 'text_value') ORDER BY column_name ASC

LIMIT

Limits the rows returned by the query

SELECT * WHERE contains(column_name, 'text_value') LIMIT 50

IN and BETWEEN

Only works for scalar numeric values and text references to class_names

SELECT * WHERE column_name IN (1, 2, 6, 10)

SELECT * WHERE column_name BETWEEN 5 AND 20

GROUP BY and UNGROUP BY

GROUP BY creates a sequence of data based on the common properties that are being grouped (i.e. frames into videos). Ungroup by splits sequences into their individual elements (i.e. videos into images).

-- Groups all data with the same label and video_id in to the same sequence
SELECT * GROUP BY label, video_id

-- Splits sequences into their original pieces
SELECT * UNGROUP BY split

EXPAND BY

EXPAND BY includes samples before and after a query condition is satisfied.

SELECT * WHERE <condition> EXPAND BY rows_before, rows_after

Functions

== and !=

Exact match, which generally requires that the sample has 1 value, i.e. no arrays

SELECT * WHERE column_name == 'text_value' -- If value is text
SELECT * WHERE column_name == numeric_value -- If values is numeric

ANY, ALL, and ALL_STRICT

all adheres to NumPy and list logic where all(empty_sample) returns True

all_strict is more intuitive for queries so all_strict(empty_sample) returns False

-- Returns True for empty samples
SELECT * WHERE all(column_name==0) 

-- Returns False for empty samples
SELECT * WHERE all_strict(column_name[:,2]>numeric_value)

SELECT * WHERE any(column_name[0:6]>numeric_value)

CONTAINS

Returns rows where the column contains the specified value.

For faster queries using CONTAINS, define the type with Inverted index_type.

SELECT * WHERE CONTAINS(column_name, 'text_value')

BM25_SIMILARITY

Performs a BM25 similarity search on a text column. Can only be used in an ORDER BY clause.

select * WHERE id > 30 ORDER BY BM25_SIMILARITY(column, 'search text string')

L1_NORM, L2_NORM, LINF_NORM, COSINE_SIMILARITY

Deep Lake supports several vector operations for embedding search.

Typically, vector operations are called by returning data ordered by the score based on the vector search method.

SELECT * FROM (SELECT column_1, column_2, <VECTOR_OPERATION> as score) ORDER BY score DESC LIMIT 10

-- Order should be ASC
l1_norm(<embedding_column> - ARRAY[<search_embedding>])

-- Order should be ASC
l2_norm(<embedding_column> - ARRAY[<search_embedding>])

-- Order should be ASC
linf_norm(<embedding_column> - ARRAY[<search_embedding>])

-- Order should be DESC
cosine_similarity(<embedding_column>, ARRAY[<search_embedding>])

LOGICAL_AND and LOGICAL_OR

Perform a logical AND or OR operation on two arrays element-wise.

SELECT * WHERE any(logical_and(
    column_name_1[:,3] > numeric_value,
    column_name_2 == 'text_value'
))

SHAPE

Returns the shape of an array-like value.

SELECT * WHERE shape(column_name)[dimension_index] > numeric_value

-- Second array dimension > value
SELECT * WHERE shape(column_name)[1] > numeric_value

DATA

Returns the value of a column at a specific index.

-- Referencing values of of a column at index (row_number)
SELECT * ORDER BY l2_norm(<column_name> - data(<column_name>, index))

-- Finds rows of data with embeddings most similar to index 10
SELECT * ORDER BY l2_norm(embedding - data(embedding, 10))

ROW_NUMBER()

Returns the offset of the row in the dataset.

The offset will be consistent within a particular version of the dataset, but may change between versions.

SELECT *, row_number() WHERE row_number() == 10

ROW_ID

Returns the stable row identifier for a row in the dataset.

Unlike row_number, row_id will be consistent across versions of the dataset.

SELECT *, row_id() WHERE row_id() == 10

Custom TQL Functions

Deep Lake supports custom TQL functions that can be defined in Python and used in TQL queries.

TQL interacts with Python functions through numpy.ndarray. The Python function to be used in TQL should accept input arguments as numpy arrays and return numpy array.

To use a custom function in TQL, you must first define the function in Python and then register it with the TQL engine.

def next_number(a):
    return a + 1

deeplake.tql.register_function(next_number)

Then you can use it like any other TQL function:

SELECT * WHERE next_number(column_name) > 10