deeplake.core.vectorstore

DeepLakeVectorStore

class deeplake.core.vectorstore.DeepLakeVectorStore

Base class for DeepLakeVectorStore

__init__(path: ~typing.Union[str, ~pathlib.Path], tensor_params: ~typing.List[~typing.Dict[str, object]] = [{'name': 'text', 'htype': 'text', 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': False}, {'name': 'metadata', 'htype': 'json', 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': False}, {'name': 'embedding', 'htype': 'embedding', 'dtype': <class 'numpy.float32'>, 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': True, 'max_chunk_size': 64000000}, {'name': 'id', 'htype': 'text', 'create_id_tensor': False, 'create_sample_info_tensor': False, 'create_shape_tensor': False}], embedding_function: ~typing.Optional[~typing.Callable] = None, read_only: ~typing.Optional[bool] = False, ingestion_batch_size: int = 1000, num_workers: int = 0, exec_option: str = 'python', token: ~typing.Optional[str] = None, overwrite: bool = False, verbose=True, **kwargs: ~typing.Any) None

Creates an empty DeepLakeVectorStore or loads an existing one if it exists at the specified path.

Examples

>>> # Create a vector store with default tensors
>>> data = DeepLakeVectorStore(
...        path = <path_for_storing_Data>,
... )
>>>
>>> # Create a vector store in the Deep Lake Managed Tensor Database
>>> data = DeepLakeVectorStore(
...        path = "hub://org_id/dataset_name",
...        runtime = {"tensor_db": True},
... )
>>>
>>> # Create a vector store with custom tensors
>>> data = DeepLakeVectorStore(
...        path = <path_for_storing_data>,
...        tensor_params = [{"name": "text", "htype": "text"},
...                         {"name": "embedding_1", "htype": "embedding"},
...                         {"name": "embedding_2", "htype": "embedding"},
...                         {"name": "source", "htype": "text"},
...                         {"name": "metadata", "htype": "json"}
...                        ]
... )
Parameters
  • path (str, pathlib.Path) –

    • The full path for storing to the Deep Lake Vector Store. It can be:

    • a Deep Lake cloud path of the form hub://org_id/dataset_name. Requires registration with Deep Lake.

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • tensor_params (List[Dict[str, dict]], optional) – List of dictionaries that contains information about tensors that user wants to create. See create_tensor in Deep Lake API docs for more information. Defaults to DEFAULT_VECTORSTORE_TENSORS.

  • embedding_function (Optional[callable], optional) – Function that converts the embeddable data into embeddings. Defaults to None.

  • read_only (bool, optional) – Opens dataset in read-only mode if True. Defaults to False.

  • ingestion_batch_size (int) – Batch size used during ingestion. Defaults to 1024.

  • num_workers (int) – The number of workers to use for ingesting data in parallel. Defaults to 0.

  • exec_option (str) – Default method for search execution. It could be either “python”, “compute_engine” or “tensor_db”. Defaults to “python”. - python - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. - compute_engine - Performant C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local datasets. - tensor_db - Performant and fully-hosted Managed Tensor Database that is responsible for storage and query execution. Only available for data stored in the Deep Lake Managed Database. Store datasets in this database by specifying runtime = {“db_engine”: True} during dataset creation.

  • token (str, optional) – Activeloop token, used for fetching user credentials. This is Optional, tokens are normally autogenerated. Defaults to None.

  • overwrite (bool) – If set to True this overwrites the Vector Store if it already exists. Defaults to False.

  • verbose (bool) – Whether to print summary of the dataset created. Defaults to True.

  • **kwargs (Any) – Additional keyword arguments.

Danger

Setting overwrite to True will delete all of your data if the Vector Store exists! Be very careful when setting this parameter.

__len__()

Length of the dataset

add(embedding_function: Optional[Callable] = None, embedding_data: Optional[List] = None, embedding_tensor: Optional[str] = None, total_samples_processed: int = 0, return_ids: bool = False, **tensors) Optional[List[str]]

Adding elements to deeplake vector store.

Tensor names are specified as parameters, and data for each tensor is specified as parameter values. All data must of equal length.

Examples

>>> # Directly upload embeddings
>>> deeplake_vector_store.add(
...     text = <list_of_texts>,
...     embedding = [list_of_embeddings]
...     metadata = <list_of_metadata_jsons>,
... )
>>>
>>> # Upload embedding via embedding function
>>> deeplake_vector_store.add(
...     text = <list_of_texts>,
...     metadata = <list_of_metadata_jsons>,
...     embedding_function = <embedding_function>,
...     embedding_data = <list_of_data_for_embedding>,
... )
>>>
>>> # Upload embedding via embedding function to a user-defined embedding tensor
>>> deeplake_vector_store.add(
...     text = <list_of_texts>,
...     metadata = <list_of_metadata_jsons>,
...     embedding_function = <embedding_function>,
...     embedding_data = <list_of_data_for_embedding>,
...     embedding_tensor = <user_defined_embedding_tensor_name>,
... )
>>> # Add data to fully custom tensors
>>> deeplake_vector_store.add(
...     tensor_A = <list_of_data_for_tensor_A>,
...     tensor_B = <list_of_data_for_tensor_B>
...     tensor_C = <list_of_data_for_tensor_C>,
...     embedding_function = <embedding_function>,
...     embedding_data = <list_of_data_for_embedding>,
...     embedding_tensor = <user_defined_embedding_tensor_name>,
... )
Parameters
  • embedding_function (Optional[Callable]) – embedding function used to convert embedding_data into embeddings. Overrides the embedding_function specified when initializing the Vector Store.

  • embedding_data (Optional[List]) – Data to be converted into embeddings using the provided embedding_function. Defaults to None.

  • embedding_tensor (Optional[str]) – Tensor where results from the embedding function will be stored. If None, the embedding tensors is automatically inferred (when possible). Defaults to None.

  • total_samples_processed (int) – Total number of samples processed before ingestion stopped. When specified.

  • return_ids (bool) – Whether to return added ids as an ouput of the method. Defaults to False.

  • **tensors – Keyword arguments where the key is the tensor name, and the value is a list of samples that should be uploaded to that tensor.

Returns

List of ids if return_ids is set to True. Otherwise, None.

Return type

Optional[List[str]]

delete(row_ids: Optional[List[str]] = None, ids: Optional[List[str]] = None, filter: Optional[Union[Dict, Callable]] = None, query: Optional[str] = None, exec_option: Optional[str] = 'python', delete_all: Optional[bool] = None) bool

Delete the data in the Vector Store. Does not delete the tensor definitions. To delete the vector store completely, first run DeepLakeVectorStore.delete_by_path().

Examples

>>> # Delete using ids:
>>> data = vector_store.delete(ids)
>>>
>>> # Delete data using filter
>>> data = vector_store.delete(
...        filter = {"json_tensor_name": {"key: value"}, "json_tensor_name_2": {"key_2: value_2"}},
... )
>>>
>>> # Delete data using TQL
>>> data = vector_store.delete(
...        query = "select * where ..... <add TQL syntax>",
...        exec_option = <preferred_exec_option>,
... )
Parameters
  • ids (Optional[List[str]]) – List of unique ids. Defaults to None.

  • row_ids (Optional[List[str]]) – List of absolute row indices from the dataset. Defaults to None.

  • filter (Union[Dict, Callable], optional) – Filter for finding samples for deletion. - Dict - Key-value search on tensors of htype json, evaluated on an AND basis (a sample must satisfy all key-value filters to be True) Dict = {“tensor_name_1”: {“key”: value}, “tensor_name_2”: {“key”: value}} - Function - Any function that is compatible with deeplake.filter.

  • query (Optional[str]) – TQL Query string for direct evaluation for finding samples for deletion, without application of additional filters.

  • exec_option (str, optional) – Method for search execution for finding samples for deletion. It could be either “python”, “compute_engine”. Defaults to “python”. - python - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. - compute_engine - Performant C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local datasets.

  • delete_all (Optional[bool]) – Whether to delete all the samples and version history of the dataset. Defaults to None.

Returns

Returns True if deletion was successful, otherwise it raises a ValueError.

Return type

bool

Raises

ValueError – If neither ids, filter, query, nor delete_all are specified, or if an invalid exec_option is provided.

static delete_by_path(path: Union[str, Path], token: Optional[str] = None) None

Deleted the Vector Store at the specified path.

Parameters
  • path (str, pathlib.Path) –

    • The full path for storing to the Deep Lake Vector Store.

  • token (str, optional) – Activeloop token, used for fetching user credentials. This is Optional, tokens are normally autogenerated. Defaults to None.

Danger

This method permanently deletes all of your data in the Vector Store exists! Be very careful when using this method.

search(embedding_data=None, embedding_function: Optional[Callable] = None, embedding: Optional[Union[List[float], ndarray]] = None, k: int = 4, distance_metric: str = 'COS', query: Optional[str] = None, filter: Optional[Union[Dict, Callable]] = None, exec_option: Optional[str] = 'python', embedding_tensor: str = 'embedding', return_tensors: Optional[List[str]] = None, return_view: bool = False) Union[Dict, Dataset]

DeepLakeVectorStore search method that combines embedding search, metadata search, and custom TQL search.

Examples

>>> # Search using an embedding
>>> data = vector_store.search(
...        embedding = <your_embedding>,
...        exec_option = <preferred_exec_option>,
... )
>>> # Search using an embedding function and data for embedding
>>> data = vector_store.search(
...        embedding_data = "What does this chatbot do?",
...        embedding_function = <your_embedding_function>,
...        exec_option = <preferred_exec_option>,
... )
>>>
>>> # Add a filter to your search
>>> data = vector_store.search(
...        embedding = <your_embedding>,
...        exec_option = <preferred_exec_option>,
...        filter = {"json_tensor_name": {"key: value"}, "json_tensor_name_2": {"key_2: value_2"},...}, # Only valid for exec_option = "python"
... )
>>>
>>> # Search using TQL
>>> data = vector_store.search(
...        query = "select * where ..... <add TQL syntax>",
...        exec_option = <preferred_exec_option>, # Only valid for exec_option = "compute_engine" or "tensor_db"
... )
Parameters
  • embedding (Union[np.ndarray, List[float]], optional) – Embedding representation for performing the search. Defaults to None. The embedding_data and embedding cannot both be specified.

  • embedding_data – Data against which the search will be performed by embedding it using the embedding_function. Defaults to None. The embedding_data and embedding cannot both be specified.

  • embedding_function (callable, optional) – function for converting embedding_data into embedding. Only valid if embedding_data is specified

  • k (int) – Number of elements to return after running query. Defaults to 4.

  • distance_metric (str) – Type of distance metric to use for sorting the data. Avaliable options are: “L1”, “L2”, “COS”, “MAX”. Defaults to “COS”.

  • query (Optional[str]) – TQL Query string for direct evaluation, without application of additional filters or vector search.

  • filter (Union[Dict, Callable], optional) – Additional filter evaluated prior to the embedding search. - Dict - Key-value search on tensors of htype json, evaluated on an AND basis (a sample must satisfy all key-value filters to be True) Dict = {“tensor_name_1”: {“key”: value}, “tensor_name_2”: {“key”: value}} - Function - Any function that is compatible with deeplake.filter.

  • exec_option (Optional[str]) – Method for search execution. It could be either “python”, “compute_engine” or “tensor_db”. Defaults to “python”. - python - Pure-python implementation that runs on the client and can be used for data stored anywhere. WARNING: using this option with big datasets is discouraged because it can lead to memory issues. - compute_engine - Performant C++ implementation of the Deep Lake Compute Engine that runs on the client and can be used for any data stored in or connected to Deep Lake. It cannot be used with in-memory or local datasets. - tensor_db - Performant and fully-hosted Managed Tensor Database that is responsible for storage and query execution. Only available for data stored in the Deep Lake Managed Database. Store datasets in this database by specifying runtime = {“db_engine”: True} during dataset creation.

  • embedding_tensor (str) – Name of tensor with embeddings. Defaults to “embedding”.

  • return_tensors (Optional[List[str]]) – List of tensors to return data for. Defaults to None. If None, all tensors are returned.

  • return_view (bool) – Return a Deep Lake dataset view that satisfied the search parameters, instead of a dictinary with data. Defaults to False.

Raises

ValueError – When invalid parameters are specified.

Returns

Dictionary where keys are tensor names and values are the results of the search

Return type

Dict

summary()

Prints a summary of the dataset

tensors()

Returns the list of tensors present in the dataset