TQL Syntax
Overview
Deep Lake offers a performant SQL-based query engine for data analysis called "TQL" (Tensor Query Language).
TQL can be used directly against a dataset with ds.query(), or using the deeplake.query() method to span datasets.
When calling query()
directly a dataset, there is no need to specify the "FROM" clause.
However, when calling the static deeplake.query()
on a dataset, the "FROM" clause is required, specifying the URL of the dataset(s) to query.
Quoting
String values should be wrapped with single quotes:
Column or dataset names with special characters should be wrapped with double-quotes:
SELECT * WHERE contains("column-name", 'text_value')
SELECT * FROM "al://my_org/dataset" WHERE id > 10
Tip
Make sure to wrap double-quotes with escape characters in Python, or use different string quoting:
Virtual Columns
Virtual columns are the result of a computation and are not columns in the Deep Lake dataset. However, they can be treated as columns in the API.
-- "score" is a virtual column
SELECT * FROM (SELECT column_1, column_2, <VECTOR_OPERATION> as score)
ORDER BY score DESC LIMIT 10
-- "box_beyond_image" is a virtual column
SELECT *, any(boxes[:,0])<0 as box_beyond_image WHERE ....
-- "column_sum" is a virtual column
-- When combining embedding search with filtering (where conditions),
-- the filter condition is evaluated prior to the embedding search.
SELECT *, column_1 + column_3 as column_sum WHERE ......
Keywords
Most common SQL keywords are supported in TQL, including:
AND, OR, NOT
Standard logical operators
SELECT * WHERE contains(column_name, 'text_value')
and NOT contains(column_name_2, numeric_value)
SELECT * WHERE contains(column_name, 'text_value')
or column_name_2 == numeric_value
SELECT * WHERE (contains(column_name, 'text_value')
and shape(column_name_2)[dimension_index]>numeric_value)
or contains(column_name, 'text_value_2')
JOIN
With Deep Lake, you can use JOIN to combine data from two datasets regardless of where they live.
SELECT * FROM "s3://bucket/users" AS users JOIN "az://other_bucket/likes" AS likes on users.id == likes.user_id
UNION and INTERSECT
UNION and INTERSECT are used to combine or intersect the results of two queries
(SELECT * WHERE contains(column_name, 'value'))
intersect (SELECT * WHERE contains(column_name, 'value_2'))
(SELECT * WHERE contains(column_name, 'value') LIMIT 100)
union (SELECT * WHERE shape(column_name)[0] > numeric_value LIMIT 100)
SAMPLE BY
SELECT * SAMPLE BY weight_choice(
expression_1: weight_1,
expression_2: weight_2, ...)
replace True LIMIT N
weight_choice
resolves the weight that is used when multiple expressions evaluate to True for a given sample. Options are max_weight, sum_weight. For example, if weight_choice is max_weight, then the maximum weight will be chosen for that sample.replace
determines whether samples should be drawn with replacement. It defaults to True.limit
specifies the number of samples that should be returned. If unspecified, the sampler will return the number of samples corresponding to the length of the dataset
ORDER BY
ORDER BY requires that sample is numeric and has 1 value, i.e. no lists or multi-dimensional arrays
The default order is ascending (ASC)
LIMIT
Limits the rows returned by the query
IN and BETWEEN
Only works for scalar numeric values and text references to class_names
GROUP BY and UNGROUP BY
GROUP BY creates a sequence of data based on the common properties that are being grouped (i.e. frames into videos). Ungroup by splits sequences into their individual elements (i.e. videos into images).
-- Groups all data with the same label and video_id in to the same sequence
SELECT * GROUP BY label, video_id
-- Splits sequences into their original pieces
SELECT * UNGROUP BY split
EXPAND BY
EXPAND BY includes samples before and after a query condition is satisfied.
Functions
== and !=
Exact match, which generally requires that the sample has 1 value, i.e. no arrays
SELECT * WHERE column_name == 'text_value' -- If value is text
SELECT * WHERE column_name == numeric_value -- If values is numeric
ANY, ALL, and ALL_STRICT
all
adheres to NumPy and list logic where all(empty_sample)
returns True
all_strict
is more intuitive for queries so all_strict(empty_sample)
returns False
-- Returns True for empty samples
SELECT * WHERE all(column_name==0)
-- Returns False for empty samples
SELECT * WHERE all_strict(column_name[:,2]>numeric_value)
SELECT * WHERE any(column_name[0:6]>numeric_value)
CONTAINS
Returns rows where the column contains the specified value.
For faster queries using CONTAINS, define the type with Inverted index_type.
BM25_SIMILARITY
Performs a BM25 similarity search on a text column. Can only be used in an ORDER BY clause.
L1_NORM, L2_NORM, LINF_NORM, COSINE_SIMILARITY
Deep Lake supports several vector operations for embedding search.
Typically, vector operations are called by returning data ordered by the score based on the vector search method.
SELECT * FROM (SELECT column_1, column_2, <VECTOR_OPERATION> as score) ORDER BY score DESC LIMIT 10
-- Order should be ASC
l1_norm(<embedding_column> - ARRAY[<search_embedding>])
-- Order should be ASC
l2_norm(<embedding_column> - ARRAY[<search_embedding>])
-- Order should be ASC
linf_norm(<embedding_column> - ARRAY[<search_embedding>])
-- Order should be DESC
cosine_similarity(<embedding_column>, ARRAY[<search_embedding>])
LOGICAL_AND and LOGICAL_OR
Perform a logical AND or OR operation on two arrays element-wise.
SELECT * WHERE any(logical_and(
column_name_1[:,3] > numeric_value,
column_name_2 == 'text_value'
))
SHAPE
Returns the shape of an array-like value.
SELECT * WHERE shape(column_name)[dimension_index] > numeric_value
-- Second array dimension > value
SELECT * WHERE shape(column_name)[1] > numeric_value
DATA
Returns the value of a column at a specific index.
-- Referencing values of of a column at index (row_number)
SELECT * ORDER BY l2_norm(<column_name> - data(<column_name>, index))
-- Finds rows of data with embeddings most similar to index 10
SELECT * ORDER BY l2_norm(embedding - data(embedding, 10))
ROW_NUMBER()
Returns the offset of the row in the dataset.
The offset will be consistent within a particular version of the dataset, but may change between versions.
ROW_ID
Returns the stable row identifier for a row in the dataset.
Unlike row_number, row_id will be consistent across versions of the dataset.
Custom TQL Functions
Deep Lake supports custom TQL functions that can be defined in Python and used in TQL queries.
TQL interacts with Python functions through numpy.ndarray. The Python function to be used in TQL should accept input arguments as numpy arrays and return numpy array.
To use a custom function in TQL, you must first define the function in Python and then register it with the TQL engine.
Then you can use it like any other TQL function: