Index Types
Deep Lake supports several index types to improve the performance of queries.
Without an index, Deep Lake must scan the entire dataset to find the rows that match a query. This can be slow for large datasets. By defining an index on a column, you can speed up queries that filter or sort operations that match that index definition.
Text Indexes
Deep Lake supports two types of indexes on text columns: Inverted
and BM25
.
Inverted Indexes
Inverted indexes are used to speed up queries that use the CONTAINS
function.
This allows you to search for specific words or phrases in a text column.
Enable an inverted index by specifying index_type=types.TextIndexType.Inverted
when adding the column.
from deeplake import types
ds.add_column("text_column", types.Text(index_type=types.TextIndexType.Inverted))
ds.query("SELECT * WHERE CONTAINS(text_column, 'term')")
BM25 Indexes
BM25 indexes are used to speed up queries that use the BM25_SIMILARITY
function.
This allows you to sort rows by their similarity to a search query.
Compared to the CONTAINS
function, BM25_SIMILARITY
provides more of a "search results" experience rather than a simple filtering.
Enable a BM25 index by specifying index_type=types.TextIndexType.BM25
when adding the column.
from deeplake import types
ds.add_column("text_column", types.Text(index_type=types.TextIndexType.BM25))
ds.query("SELECT * WHERE id > 30 ORDER BY BM25_SIMILARITY(text_column, 'search text string') LIMIT 100")
Vector Indexes
COSINE_SIMILARITY Indexes
Cosine similarity indexes are used to speed up queries that use the COSINE_SIMILARITY
function.
This allows you to sort rows by their similarity to a search query.
Cosine similarity indexes are automatically created on embedding-typed column.
Quantization
Deep Lake supports quantization of embeddings to reduce the size of the index.
This can slightly decrease the accuracy of results, but can significantly improve query performance and reduce the on-disk size of the index.
Quantization can be enabled by setting quantization=types.QuantizationType.Binary
Example
from deeplake import types
ds.add_column("embedding_column", types.Embedding(768,
quantization=types.QuantizationType.Binary))
embed_query = model.encode("search text string")
str_query = ",".join(str(c) for c in embed_query)
ds.query(f"SELECT * WHERE id > 30 ORDER BY COSINE_SIMILARITY(embedding_column, array[{str_query}]) DESC LIMIT 100")
Next Steps
- Learn more about TQL
- Visit the Types API Reference for more information on types and indexes