deeplake.core.vectorstore
DeepLakeVectorStore
- class deeplake.core.vectorstore.DeepLakeVectorStore
Base class for DeepLakeVectorStore
- __init__(path: ~typing.Union[str, ~pathlib.Path], tensor_params: ~typing.List[~typing.Dict[str, object]] = [{'name': 'text', 'htype': 'text', 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': False}, {'name': 'metadata', 'htype': 'json', 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': False}, {'name': 'embedding', 'htype': 'embedding', 'dtype': <class 'numpy.float32'>, 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': True, 'max_chunk_size': 64000000}, {'name': 'id', 'htype': 'text', 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': False}], embedding_function: ~typing.Optional[~typing.Callable] = None, read_only: ~typing.Optional[bool] = False, ingestion_batch_size: int = 1000, num_workers: int = 0, exec_option: str = 'python', token: ~typing.Optional[str] = None, overwrite: bool = False, verbose=True, **kwargs: ~typing.Any) None
Creates an empty DeepLakeVectorStore or loads an existing one if it exists at the specified
path
.Examples
>>> # Create a vector store with default tensors >>> data = DeepLakeVectorStore( ... path = <path_for_storing_Data>, ... ) >>> >>> # Create a vector store in the Deep Lake Managed Tensor Database >>> data = DeepLakeVectorStore( ... path = "hub://org_id/dataset_name", ... runtime = {"tensor_db": True}, ... ) >>> >>> # Create a vector store with custom tensors >>> data = DeepLakeVectorStore( ... path = <path_for_storing_data>, ... tensor_params = [{"name": "text", "htype": "text"}, ... {"name": "embedding_1", "htype": "embedding"}, ... {"name": "embedding_2", "htype": "embedding"}, ... {"name": "source", "htype": "text"}, ... {"name": "metadata", "htype": "json"} ... ] ... )
- Parameters
path (str, pathlib.Path) –
The full path for storing to the Deep Lake Vector Store. It can be:
a Deep Lake cloud path of the form
hub://org_id/dataset_name
. Requires registration with Deep Lake.an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
tensor_params (List[Dict[str, dict]], optional) – List of dictionaries that contains information about tensors that user wants to create. See
create_tensor
in Deep Lake API docs for more information. Defaults toDEFAULT_VECTORSTORE_TENSORS
.embedding_function (Optional[callable], optional) – Function that converts the embeddable data into embeddings. Defaults to None.
read_only (bool, optional) – Opens dataset in read-only mode if True. Defaults to False.
ingestion_batch_size (int) – Batch size used during ingestion. Defaults to 1024.
num_workers (int) – The number of workers to use for ingesting data in parallel. Defaults to 0.
exec_option (str) – Default method for search execution. It could be either “python”, “compute_engine” or “tensor_db”. Defaults to “python”. -
python
- Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. -compute_engine
- Performant C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local datasets. -tensor_db
- Performant and fully-hosted Managed Tensor Database that is responsible for storage and query execution. Only available for data stored in the Deep Lake Managed Database. Store datasets in this database by specifying runtime = {“db_engine”: True} during dataset creation.token (str, optional) – Activeloop token, used for fetching user credentials. This is Optional, tokens are normally autogenerated. Defaults to None.
overwrite (bool) – If set to True this overwrites the Vector Store if it already exists. Defaults to False.
verbose (bool) – Whether to print summary of the dataset created. Defaults to True.
**kwargs (Any) – Additional keyword arguments.
Danger
Setting
overwrite
toTrue
will delete all of your data if the Vector Store exists! Be very careful when setting this parameter.
- __len__()
Length of the dataset
- add(embedding_function: Optional[Callable] = None, embedding_data: Optional[List] = None, embedding_tensor: Optional[str] = None, total_samples_processed: int = 0, return_ids: bool = False, **tensors) Optional[List[str]]
Adding elements to deeplake vector store.
Tensor names are specified as parameters, and data for each tensor is specified as parameter values. All data must of equal length.
Examples
>>> # Directly upload embeddings >>> deeplake_vector_store.add( ... text = <list_of_texts>, ... embedding = [list_of_embeddings] ... metadata = <list_of_metadata_jsons>, ... ) >>> >>> # Upload embedding via embedding function >>> deeplake_vector_store.add( ... text = <list_of_texts>, ... metadata = <list_of_metadata_jsons>, ... embedding_function = <embedding_function>, ... embedding_data = <list_of_data_for_embedding>, ... ) >>> >>> # Upload embedding via embedding function to a user-defined embedding tensor >>> deeplake_vector_store.add( ... text = <list_of_texts>, ... metadata = <list_of_metadata_jsons>, ... embedding_function = <embedding_function>, ... embedding_data = <list_of_data_for_embedding>, ... embedding_tensor = <user_defined_embedding_tensor_name>, ... ) >>> # Add data to fully custom tensors >>> deeplake_vector_store.add( ... tensor_A = <list_of_data_for_tensor_A>, ... tensor_B = <list_of_data_for_tensor_B> ... tensor_C = <list_of_data_for_tensor_C>, ... embedding_function = <embedding_function>, ... embedding_data = <list_of_data_for_embedding>, ... embedding_tensor = <user_defined_embedding_tensor_name>, ... )
- Parameters
embedding_function (Optional[Callable]) – embedding function used to convert
embedding_data
into embeddings. Overrides theembedding_function
specified when initializing the Vector Store.embedding_data (Optional[List]) – Data to be converted into embeddings using the provided
embedding_function
. Defaults to None.embedding_tensor (Optional[str]) – Tensor where results from the embedding function will be stored. If None, the embedding tensors is automatically inferred (when possible). Defaults to None.
total_samples_processed (int) – Total number of samples processed before ingestion stopped. When specified.
return_ids (bool) – Whether to return added ids as an ouput of the method. Defaults to False.
**tensors – Keyword arguments where the key is the tensor name, and the value is a list of samples that should be uploaded to that tensor.
- Returns
List of ids if
return_ids
is set to True. Otherwise, None.- Return type
Optional[List[str]]
- delete(row_ids: Optional[List[str]] = None, ids: Optional[List[str]] = None, filter: Optional[Union[Dict, Callable]] = None, query: Optional[str] = None, exec_option: Optional[str] = 'python', delete_all: Optional[bool] = None) bool
Delete the data in the Vector Store. Does not delete the tensor definitions. To delete the vector store completely, first run
DeepLakeVectorStore.delete_by_path()
.Examples
>>> # Delete using ids: >>> data = vector_store.delete(ids) >>> >>> # Delete data using filter >>> data = vector_store.delete( ... filter = {"json_tensor_name": {"key: value"}, "json_tensor_name_2": {"key_2: value_2"}}, ... ) >>> >>> # Delete data using TQL >>> data = vector_store.delete( ... query = "select * where ..... <add TQL syntax>", ... exec_option = <preferred_exec_option>, ... )
- Parameters
ids (Optional[List[str]]) – List of unique ids. Defaults to None.
row_ids (Optional[List[str]]) – List of absolute row indices from the dataset. Defaults to None.
filter (Union[Dict, Callable], optional) – Filter for finding samples for deletion. -
Dict
- Key-value search on tensors of htype json, evaluated on an AND basis (a sample must satisfy all key-value filters to be True) Dict = {“tensor_name_1”: {“key”: value}, “tensor_name_2”: {“key”: value}} -Function
- Any function that is compatible with deeplake.filter.query (Optional[str]) – TQL Query string for direct evaluation for finding samples for deletion, without application of additional filters.
exec_option (str, optional) – Method for search execution for finding samples for deletion. It could be either “python”, “compute_engine”. Defaults to “python”. -
python
- Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. -compute_engine
- Performant C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local datasets.delete_all (Optional[bool]) – Whether to delete all the samples and version history of the dataset. Defaults to None.
- Returns
Returns True if deletion was successful, otherwise it raises a ValueError.
- Return type
bool
- Raises
ValueError – If neither ids, filter, query, nor delete_all are specified, or if an invalid exec_option is provided.
- static delete_by_path(path: Union[str, Path], token: Optional[str] = None) None
Deleted the Vector Store at the specified path.
- Parameters
path (str, pathlib.Path) –
The full path for storing to the Deep Lake Vector Store.
token (str, optional) – Activeloop token, used for fetching user credentials. This is Optional, tokens are normally autogenerated. Defaults to None.
Danger
This method permanently deletes all of your data in the Vector Store exists! Be very careful when using this method.
- search(embedding_data=None, embedding_function: Optional[Callable] = None, embedding: Optional[Union[List[float], ndarray]] = None, k: int = 4, distance_metric: str = 'COS', query: Optional[str] = None, filter: Optional[Union[Dict, Callable]] = None, exec_option: Optional[str] = 'python', embedding_tensor: str = 'embedding', return_tensors: Optional[List[str]] = None, return_view: bool = False) Union[Dict, Dataset]
DeepLakeVectorStore search method that combines embedding search, metadata search, and custom TQL search.
Examples
>>> # Search using an embedding >>> data = vector_store.search( ... embedding = <your_embedding>, ... exec_option = <preferred_exec_option>, ... ) >>> # Search using an embedding function and data for embedding >>> data = vector_store.search( ... embedding_data = "What does this chatbot do?", ... embedding_function = <your_embedding_function>, ... exec_option = <preferred_exec_option>, ... ) >>> >>> # Add a filter to your search >>> data = vector_store.search( ... embedding = <your_embedding>, ... exec_option = <preferred_exec_option>, ... filter = {"json_tensor_name": {"key: value"}, "json_tensor_name_2": {"key_2: value_2"},...}, # Only valid for exec_option = "python" ... ) >>> >>> # Search using TQL >>> data = vector_store.search( ... query = "select * where ..... <add TQL syntax>", ... exec_option = <preferred_exec_option>, # Only valid for exec_option = "compute_engine" or "tensor_db" ... )
- Parameters
embedding (Union[np.ndarray, List[float]], optional) – Embedding representation for performing the search. Defaults to None. The
embedding_data
andembedding
cannot both be specified.embedding_data – Data against which the search will be performed by embedding it using the embedding_function. Defaults to None. The embedding_data and embedding cannot both be specified.
embedding_function (callable, optional) – function for converting embedding_data into embedding. Only valid if embedding_data is specified
k (int) – Number of elements to return after running query. Defaults to 4.
distance_metric (str) – Type of distance metric to use for sorting the data. Avaliable options are: “L1”, “L2”, “COS”, “MAX”. Defaults to “COS”.
query (Optional[str]) – TQL Query string for direct evaluation, without application of additional filters or vector search.
filter (Union[Dict, Callable], optional) – Additional filter evaluated prior to the embedding search. -
Dict
- Key-value search on tensors of htype json, evaluated on an AND basis (a sample must satisfy all key-value filters to be True) Dict = {“tensor_name_1”: {“key”: value}, “tensor_name_2”: {“key”: value}} -Function
- Any function that is compatible with deeplake.filter.exec_option (Optional[str]) – Method for search execution. It could be either “python”, “compute_engine” or “tensor_db”. Defaults to “python”. -
python
- Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. -compute_engine
- Performant C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local datasets. -tensor_db
- Performant and fully-hosted Managed Tensor Database that is responsible for storage and query execution. Only available for data stored in the Deep Lake Managed Database. Store datasets in this database by specifying runtime = {“db_engine”: True} during dataset creation.embedding_tensor (str) – Name of tensor with embeddings. Defaults to “embedding”.
return_tensors (Optional[List[str]]) – List of tensors to return data for. Defaults to None. If None, all tensors are returned.
return_view (bool) – Return a Deep Lake dataset view that satisfied the search parameters, instead of a dictinary with data. Defaults to False.
- Raises
ValueError – When invalid parameters are specified.
- Returns
Dictionary where keys are tensor names and values are the results of the search
- Return type
Dict
- summary()
Prints a summary of the dataset
- tensors()
Returns the list of tensors present in the dataset