Skip to content

Index Types

Deep Lake supports several index types to improve the performance of queries.

Without an index, Deep Lake must scan the entire dataset to find the rows that match a query. This can be slow for large datasets. By defining an index on a column, you can speed up queries that filter or sort operations that match that index definition.

Text Indexes

Deep Lake supports two types of indexes on text columns: Inverted and BM25.

Inverted Indexes

Inverted indexes are used to speed up queries that use the CONTAINS function.

This allows you to search for specific words or phrases in a text column.

Enable an inverted index by specifying index_type=types.TextIndexType.Inverted when adding the column.

from deeplake import types

ds.add_column("text_column", types.Text(index_type=types.TextIndexType.Inverted))

ds.query("SELECT * WHERE CONTAINS(text_column, 'term')")

BM25 Indexes

BM25 indexes are used to speed up queries that use the BM25_SIMILARITY function.

This allows you to sort rows by their similarity to a search query.

Compared to the CONTAINS function, BM25_SIMILARITY provides more of a "search results" experience rather than a simple filtering.

Enable a BM25 index by specifying index_type=types.TextIndexType.BM25 when adding the column.

from deeplake import types

ds.add_column("text_column", types.Text(index_type=types.TextIndexType.BM25))

ds.query("SELECT * WHERE id > 30 ORDER BY BM25_SIMILARITY(text_column, 'search text string') LIMIT 100")

Vector Indexes

COSINE_SIMILARITY Indexes

Cosine similarity indexes are used to speed up queries that use the COSINE_SIMILARITY function.

This allows you to sort rows by their similarity to a search query.

Cosine similarity indexes are automatically created on embedding-typed column.

Quantization

Deep Lake supports quantization of embeddings to reduce the size of the index.

This can slightly decrease the accuracy of results, but can significantly improve query performance and reduce the on-disk size of the index.

Quantization can be enabled by setting quantization=types.QuantizationType.Binary

Example

from deeplake import types

ds.add_column("embedding_column", types.Embedding(768, 
                    quantization=types.QuantizationType.Binary))

embed_query = model.encode("search text string")
str_query = ",".join(str(c) for c in embed_query)

ds.query(f"SELECT * WHERE id > 30 ORDER BY COSINE_SIMILARITY(embedding_column, array[{str_query}]) DESC LIMIT 100")

Next Steps