deeplake

The deeplake package provides a database which stores data as compressed chunked arrays that can be stored anywhere and later streamed to deep learning models.

deeplake.dataset(path: str | Path, read_only: bool | None = None, overwrite: bool = False, public: bool = False, memory_cache_size: int = 256, local_cache_size: int = 0, creds: str | Dict | None = None, token: str | None = None, verbose: bool = True, access_method: str = 'stream')

Returns a Dataset object referencing either a new or existing dataset.

Examples

>>> ds = deeplake.dataset("hub://username/dataset")
>>> ds = deeplake.dataset("s3://mybucket/my_dataset")
>>> ds = deeplake.dataset("./datasets/my_dataset", overwrite=True)
Parameters:
  • path (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • read_only (bool, optional) – Opens dataset in read only mode if this is passed as True. Defaults to False. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.

  • overwrite (bool) – If set to True this overwrites the dataset if it already exists. Defaults to False.

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to True.

  • memory_cache_size (int) – The size of the memory cache to be used in MB.

  • local_cache_size (int) – The size of the local filesystem cache to be used in MB.

  • creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

  • access_method (str) –

    The access method to use for the dataset. Can be:

    • ’stream’

      • Streams the data from the dataset i.e. only fetches data when required. This is the default value.

    • ’download’

      • Downloads the data to the local filesystem to the path specified in environment variable DEEPLAKE_DOWNLOAD_PATH. This will overwrite DEEPLAKE_DOWNLOAD_PATH.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set or if the dataset does not exist.

      • The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.

    • ’local’

      • Downloads dataset if it doesn’t already exist, otherwise loads from local storage.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set or the dataset is not found in DEEPLAKE_DOWNLOAD_PATH.

      • The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.

Returns:

Dataset created using the arguments provided.

Return type:

Dataset

Raises:

Danger

Setting overwrite to True will delete all of your data if it exists! Be very careful when setting this parameter.

Warning

Setting access_method to download will overwrite the local copy of the dataset if it was previously downloaded.

Note

Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.

deeplake.empty(path: str | Path, overwrite: bool = False, public: bool = False, memory_cache_size: int = 256, local_cache_size: int = 0, creds: dict | None = None, token: str | None = None, verbose: bool = True) Dataset

Creates an empty dataset

Parameters:
  • path (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • overwrite (bool) – If set to True this overwrites the dataset if it already exists. Defaults to False.

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.

  • memory_cache_size (int) – The size of the memory cache to be used in MB.

  • local_cache_size (int) – The size of the local filesystem cache to be used in MB.

  • creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

Returns:

Dataset created using the arguments provided.

Return type:

Dataset

Raises:

Danger

Setting overwrite to True will delete all of your data if it exists! Be very careful when setting this parameter.

deeplake.like(dest: str | Path, src: str | Dataset | Path, tensors: List[str] | None = None, overwrite: bool = False, creds: dict | None = None, token: str | None = None, public: bool = False) Dataset

Creates a new dataset by copying the source dataset’s structure to a new location. No samples are copied, only the meta/info for the dataset and it’s tensors.

Parameters:
  • dest – Empty Dataset or Path where the new dataset will be created.

  • src (Union[str, Dataset]) – Path or dataset object that will be used as the template for the new dataset.

  • tensors (List[str], optional) – Names of tensors (and groups) to be replicated. If not specified all tensors in source dataset are considered.

  • overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.

  • creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.

Returns:

New dataset object.

Return type:

Dataset

deeplake.ingest(src: str | Path, dest: str | Path, images_compression: str = 'auto', dest_creds: Dict | None = None, progressbar: bool = True, summary: bool = True, **dataset_kwargs) Dataset

Ingests a dataset from a source and stores it as a structured dataset to destination.

Parameters:
  • src (str, pathlib.Path) – Local path to where the unstructured dataset is stored or path to csv file.

  • dest (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • images_compression (str) – For image classification datasets, this compression will be used for the images tensor. If images_compression is “auto”, compression will be automatically determined by the most common extension in the directory.

  • dest_creds (Optional[Dict]) – A dictionary containing credentials used to access the destination path of the dataset.

  • progressbar (bool) – Enables or disables ingestion progress bar. Defaults to True.

  • summary (bool) – If True, a summary of skipped files will be printed after completion. Defaults to True.

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function.

Returns:

New dataset object with structured dataset.

Return type:

Dataset

Raises:

Note

  • Currently only local source paths and image classification datasets / csv files are supported for automatic ingestion.

  • Supported filetypes: png/jpeg/jpg/csv.

  • All files and sub-directories with unsupported filetypes are ignored.

  • Valid source directory structures for image classification look like:

    data/
        img0.jpg
        img1.jpg
        ...
    
  • or:

    data/
        class0/
            cat0.jpg
            ...
        class1/
            dog0.jpg
            ...
        ...
    
  • or:

    data/
        train/
            class0/
                img0.jpg
                ...
            ...
        val/
            class0/
                img0.jpg
                ...
            ...
        ...
    
  • Classes defined as sub-directories can be accessed at ds["test/labels"].info.class_names.

  • Support for train and test sub directories is present under ds["train/images"], ds["train/labels"] and ds["test/images"], ds["test/labels"].

  • Mapping filenames to classes from an external file is currently not supported.

deeplake.ingest_kaggle(tag: str, src: str | Path, dest: str | Path, exist_ok: bool = False, images_compression: str = 'auto', dest_creds: Dict | None = None, kaggle_credentials: dict | None = None, progressbar: bool = True, summary: bool = True, **dataset_kwargs) Dataset

Download and ingest a kaggle dataset and store it as a structured dataset to destination.

Parameters:
  • tag (str) – Kaggle dataset tag. Example: "coloradokb/dandelionimages" points to https://www.kaggle.com/coloradokb/dandelionimages

  • src (str, pathlib.Path) – Local path to where the raw kaggle dataset will be downlaoded to.

  • dest (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • exist_ok (bool) – If the kaggle dataset was already downloaded and exist_ok is True, ingestion will proceed without error.

  • images_compression (str) – For image classification datasets, this compression will be used for the images tensor. If images_compression is “auto”, compression will be automatically determined by the most common extension in the directory.

  • dest_creds (Optional[Dict]) – A dictionary containing credentials used to access the destination path of the dataset.

  • kaggle_credentials (dict) – A dictionary containing kaggle credentials {“username”:”YOUR_USERNAME”, “key”: “YOUR_KEY”}. If None, environment variables/the kaggle.json file will be used if available.

  • progressbar (bool) – Enables or disables ingestion progress bar. Set to True by default.

  • summary (bool) – Generates ingestion summary. Set to True by default.

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See deeplake.dataset().

Returns:

New dataset object with structured dataset.

Return type:

Dataset

Raises:

SamePathException – If the source and destination path are same.

Note

Currently only local source paths and image classification datasets are supported for automatic ingestion.

deeplake.ingest_dataframe(src, dest: str | Path | Dataset, dest_creds: Dict | None = None, progressbar: bool = True, **dataset_kwargs)

Convert pandas dataframe to a Deep Lake Dataset.

Parameters:
  • src (pd.DataFrame) – The pandas dataframe to be converted.

  • dest (str, pathlib.Path, Dataset) –

    • A Dataset or The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • dest_creds (Optional[Dict]) – A dictionary containing credentials used to access the destination path of the dataset.

  • progressbar (bool) – Enables or disables ingestion progress bar. Set to True by default.

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See deeplake.dataset().

Returns:

New dataset created from the dataframe.

Return type:

Dataset

Raises:

Exception – If src is not a valid pandas dataframe object.

deeplake.ingest_huggingface(src, dest, use_progressbar=True) Dataset

Converts Hugging Face datasets to Deep Lake format.

Parameters:
  • src (hfDataset, DatasetDict) – Hugging Face Dataset or DatasetDict to be converted. Data in different splits of a DatasetDict will be stored under respective tensor groups.

  • dest (Dataset, str, pathlib.Path) – Destination dataset or path to it.

  • use_progressbar (bool) – Defines if progress bar should be used to show conversion progress.

Returns:

The destination Deep Lake dataset.

Return type:

Dataset

Note

  • if DatasetDict looks like:

    >>> {
    ...    train: Dataset({
    ...        features: ['data']
    ...    }),
    ...    validation: Dataset({
    ...        features: ['data']
    ...    }),
    ...    test: Dataset({
    ...        features: ['data']
    ...    }),
    ... }
    

it will be converted to a Deep Lake Dataset with tensors ['train/data', 'validation/data', 'test/data'].

Features of the type Sequence(feature=Value(dtype='string')) are not supported. Columns of such type are skipped.

deeplake.load(path: str | Path, read_only: bool | None = None, memory_cache_size: int = 256, local_cache_size: int = 0, creds: dict | None = None, token: str | None = None, verbose: bool = True, access_method: str = 'stream') Dataset

Loads an existing dataset

Parameters:
  • path (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • read_only (bool, optional) – Opens dataset in read only mode if this is passed as True. Defaults to False. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.

  • memory_cache_size (int) – The size of the memory cache to be used in MB.

  • local_cache_size (int) – The size of the local filesystem cache to be used in MB.

  • creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

  • access_method (str) –

    The access method to use for the dataset. Can be:

    • ’stream’

      • Streams the data from the dataset i.e. only fetches data when required. This is the default value.

    • ’download’

      • Downloads the data to the local filesystem to the path specified in environment variable DEEPLAKE_DOWNLOAD_PATH. This will overwrite DEEPLAKE_DOWNLOAD_PATH.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set or if the dataset does not exist.

      • The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.

    • ’local’

      • Downloads dataset if it doesn’t already exist, otherwise loads from local storage.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set or the dataset is not found in DEEPLAKE_DOWNLOAD_PATH.

      • The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.

Returns:

Dataset loaded using the arguments provided.

Return type:

Dataset

Raises:

Warning

Setting access_method to download will overwrite the local copy of the dataset if it was previously downloaded.

Note

Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.

deeplake.delete(path: str | Path, force: bool = False, large_ok: bool = False, creds: dict | None = None, token: str | None = None, verbose: bool = False) None

Deletes a dataset at a given path.

Parameters:
  • path (str, pathlib.Path) – The path to the dataset to be deleted.

  • force (bool) – Delete data regardless of whether it looks like a deeplake dataset. All data at the path will be removed if set to True.

  • large_ok (bool) – Delete datasets larger than 1GB. Disabled by default.

  • creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

Raises:
  • DatasetHandlerError – If a Dataset does not exist at the given path and force = False.

  • NotImplementedError – When attempting to delete a managed view.

Warning

This is an irreversible operation. Data once deleted cannot be recovered.

deeplake.rename(old_path: str | Path, new_path: str | Path, creds: dict | None = None, token: str | None = None) Dataset

Renames dataset at old_path to new_path.

Examples

>>> deeplake.rename("hub://username/image_ds", "hub://username/new_ds")
>>> deeplake.rename("s3://mybucket/my_ds", "s3://mybucket/renamed_ds")
Parameters:
  • old_path (str, pathlib.Path) – The path to the dataset to be renamed.

  • new_path (str, pathlib.Path) – Path to the dataset after renaming.

  • creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • This takes precedence over credentials present in the environment. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’ and ‘aws_region’ as keys.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

Returns:

The renamed Dataset.

Return type:

Dataset

Raises:

DatasetHandlerError – If a Dataset does not exist at the given path or if new path is to a different directory.

deeplake.copy(src: str | Path | Dataset, dest: str | Path, tensors: List[str] | None = None, overwrite: bool = False, src_creds=None, src_token=None, dest_creds=None, dest_token=None, num_workers: int = 0, scheduler='threaded', progressbar=True)

Copies dataset at src to dest. Version control history is not included.

Parameters:
  • src (Union[str, Dataset, pathlib.Path]) – The Dataset or the path to the dataset to be copied.

  • dest (str, pathlib.Path) – Destination path to copy to.

  • tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.

  • overwrite (bool) – If True and a dataset exists at dest, it will be overwritten. Defaults to False.

  • src_creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.

  • src_token (str, optional) – Activeloop token, used for fetching credentials to the dataset at src if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • dest_creds (dict, optional) – creds required to create / overwrite datasets at dest.

  • dest_token (str, optional) – token used to for fetching credentials to dest.

  • num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool) – Displays a progress bar if True (default).

Returns:

New dataset object.

Return type:

Dataset

Raises:

DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.

deeplake.deepcopy(src: str | Path, dest: str | Path, tensors: List[str] | None = None, overwrite: bool = False, src_creds=None, src_token=None, dest_creds=None, dest_token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, public: bool = False, verbose: bool = True)

Copies dataset at src to dest including version control history.

Parameters:
  • src (str, pathlib.Path) – Path to the dataset to be copied.

  • dest (str, pathlib.Path) – Destination path to copy to.

  • tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.

  • overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.

  • src_creds (dict, optional) –

    • A dictionary containing credentials used to access the dataset at the path.

    • If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.

    • It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.

  • src_token (str, optional) – Activeloop token, used for fetching credentials to the dataset at src if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • dest_creds (dict, optional) – creds required to create / overwrite datasets at dest.

  • dest_token (str, optional) – token used to for fetching credentials to dest.

  • num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool) – Displays a progress bar if True (default).

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

Returns:

New dataset object.

Return type:

Dataset

Raises:

DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.

deeplake.connect(src_path: str, creds_key: str, dest_path: str | None = None, org_id: str | None = None, ds_name: str | None = None, token: str | None = None) Dataset

Connects dataset at src_path to Deep Lake via the provided path.

Examples

>>> # Connect an s3 dataset
>>> ds = deeplake.connect(src_path="s3://bucket/dataset", dest_path="hub://my_org/dataset", creds_key="my_managed_credentials_key")
>>> # or
>>> ds = deeplake.connect(src_path="s3://bucket/dataset", org_id="my_org", creds_key="my_managed_credentials_key")
Parameters:
  • src_path (str) – Cloud path to the source dataset. Can be: an s3 path like s3://bucket/path/to/dataset. a gcs path like gcs://bucket/path/to/dataset.

  • creds_key (str) – The managed credentials to be used for accessing the source path.

  • dest_path (str, optional) – The full path to where the connected Deep Lake dataset will reside. Can be: a Deep Lake path like hub://organization/dataset

  • org_id (str, optional) – The organization to where the connected Deep Lake dataset will be added.

  • ds_name (str, optional) – The name of the connected Deep Lake dataset. Will be infered from dest_path or src_path if not provided.

  • token (str, optional) – Activeloop token used to fetch the managed credentials.

Returns:

The connected Deep Lake dataset.

Return type:

Dataset

Raises:
  • InvalidSourcePathError – If the src_path is not a valid s3 or gcs path.

  • InvalidDestinationPathError – If dest_path, or org_id and ds_name do not form a valid Deep Lake path.

deeplake.list(workspace: str = '', token: str | None = None) None

List all available Deep Lake cloud datasets.

Parameters:
  • workspace (str) – Specify user/organization name. If not given, returns a list of all datasets that can be accessed, regardless of what workspace they are in. Otherwise, lists all datasets in the given workspace.

  • token (str, optional) – Activeloop token, used for fetching credentials for Deep Lake datasets. This is optional, tokens are normally autogenerated.

Returns:

List of dataset names.

Return type:

List

deeplake.exists(path: str | Path, creds: dict | None = None, token: str | None = None) bool

Checks if a dataset exists at the given path.

Parameters:
  • path (str, pathlib.Path) – the path which needs to be checked.

  • creds (dict, optional) – A dictionary containing credentials used to access the dataset at the path.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

Returns:

A boolean confirming whether the dataset exists or not at the given path.

deeplake.read(path: str | Path, verify: bool = False, creds: Dict | None = None, compression: str | None = None, storage: StorageProvider | None = None) Sample

Utility that reads raw data from supported files into Deep Lake format.

  • Recompresses data into format required by the tensor if permitted by the tensor htype.

  • Simply copies the data in the file if file format matches sample_compression of the tensor, thus maximizing upload speeds.

Examples

>>> ds.create_tensor("images", htype="image", sample_compression="jpeg")
>>> ds.images.append(deeplake.read("path/to/cat.jpg"))
>>> ds.images.shape
(1, 399, 640, 3)
>>> ds.create_tensor("videos", htype="video", sample_compression="mp4")
>>> ds.videos.append(deeplake.read("path/to/video.mp4"))
>>> ds.videos.shape
(1, 136, 720, 1080, 3)
>>> ds.create_tensor("images", htype="image", sample_compression="jpeg")
>>> ds.images.append(deeplake.read("https://picsum.photos/200/300"))
>>> ds.images[0].shape
(300, 200, 3)

Supported file types:

Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
Audio: "flac", "mp3", "wav"
Video: "mp4", "mkv", "avi"
Dicom: "dcm"
Parameters:
  • path (str) – Path to a supported file.

  • verify (bool) – If True, contents of the file are verified.

  • creds (optional, Dict) – Credentials for s3, gcp and http urls.

  • compression (optional, str) – Format of the file. Only required if path does not have an extension.

  • storage (optional, StorageProvider) – Storage provider to use to retrieve remote files. Useful if multiple files are being read from same storage to minimize overhead of creating a new provider.

Returns:

Sample object. Call sample.array to get the np.ndarray.

Return type:

Sample

Note

No data is actually loaded until you try to get a property of the returned Sample. This is useful for passing along to Tensor.append and Tensor.extend.

Utility that stores a link to raw data. Used to add data to a Deep Lake Dataset without copying it. See Link htype.

Supported file types:

Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
Audio: "flac", "mp3", "wav"
Video: "mp4", "mkv", "avi"
Dicom: "dcm"
Parameters:
  • path (str) – Path to a supported file.

  • creds_key (optional, str) – The credential key to use to read data for this sample. The actual credentials are fetched from the dataset.

Returns:

LinkedSample object that stores path and creds.

Return type:

LinkedSample

Examples

>>> ds = deeplake.dataset("test/test_ds")
>>> ds.create_tensor("images", htype="link[image]")
>>> ds.images.append(deeplake.link("https://picsum.photos/200/300"))

See more examples here.

deeplake.tiled(sample_shape: Tuple[int, ...], tile_shape: Tuple[int, ...] | None = None, dtype: str | dtype = dtype('uint8'))

Allocates an empty sample of shape sample_shape, broken into tiles of shape tile_shape (except for edge tiles).

Example

>>> with ds:
...    ds.create_tensor("image", htype="image", sample_compression="png")
...    ds.image.append(deeplake.tiled(sample_shape=(1003, 1103, 3), tile_shape=(10, 10, 3)))
...    ds.image[0][-217:, :212, 1:] = np.random.randint(0, 256, (217, 212, 2), dtype=np.uint8)
Parameters:
  • sample_shape (Tuple[int, ...]) – Full shape of the sample.

  • tile_shape (Optional, Tuple[int, ...]) – The sample will be will stored as tiles where each tile will have this shape (except edge tiles). If not specified, it will be computed such that each tile is close to half of the tensor’s max_chunk_size (after compression).

  • dtype (Union[str, np.dtype]) – Dtype for the sample array. Default uint8.

Returns:

A PartialSample instance which can be appended to a Tensor.

Return type:

PartialSample

deeplake.compute(fn, name: str | None = None) Callable[[...], ComputeFunction]

Compute is a decorator for functions.

The functions should have atleast 2 argument, the first two will correspond to sample_in and samples_out.

There can be as many other arguments as required.

The output should be appended/extended to the second argument in a deeplake like syntax.

Any value returned by the fn will be ignored.

Example:

@deeplake.compute
def my_fn(sample_in: Any, samples_out, my_arg0, my_arg1=0):
    samples_out.my_tensor.append(my_arg0 * my_arg1)

# This transform can be used using the eval method in one of these 2 ways:-

# Directly evaluating the method
# here arg0 and arg1 correspond to the 3rd and 4th argument in my_fn
my_fn(arg0, arg1).eval(data_in, ds_out, scheduler="threaded", num_workers=5)

# As a part of a Transform pipeline containing other functions
pipeline = deeplake.compose([my_fn(a, b), another_function(x=2)])
pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)

The eval method evaluates the pipeline/transform function.

It has the following arguments:

  • data_in: Input passed to the transform to generate output dataset.

    • It should support __getitem__ and __len__. This can be a Deep Lake dataset.

  • ds_out (Dataset, optional): The dataset object to which the transform will get written.

    • If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.

    • It should have all keys being generated in output already present as tensors.

    • It’s initial state should be either:

      • Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.

      • All tensors are populated and have same length. In this case new samples are appended to the dataset.

  • num_workers (int): The number of workers to use for performing the transform.

    • Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str): The scheduler to be used to compute the transformation.

    • Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool): Displays a progress bar if True (default).

  • skip_ok (bool): If True, skips the check for output tensors generated.

    • This allows the user to skip certain tensors in the function definition.

    • This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to False.

It raises the following errors:

  • InvalidInputDataError: If data_in passed to transform is invalid. It should support __getitem__ and __len__ operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as data_in will also raise this.

  • InvalidOutputDatasetError: If all the tensors of ds_out passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as ds_out will also raise this.

  • TensorMismatchError: If one or more of the outputs generated during transform contain different tensors than the ones present in ‘ds_out’ provided to transform.

  • UnsupportedSchedulerError: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.

  • TransformError: All other exceptions raised if there are problems while running the pipeline.

deeplake.compose(functions: List[ComputeFunction])

Takes a list of functions decorated using deeplake.compute() and creates a pipeline that can be evaluated using .eval

Example:

pipeline = deeplake.compose([my_fn(a=3), another_function(b=2)])
pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)

The eval method evaluates the pipeline/transform function.

It has the following arguments:

  • data_in: Input passed to the transform to generate output dataset.

    • It should support __getitem__ and __len__. This can be a Deep Lake dataset.

  • ds_out (Dataset, optional): The dataset object to which the transform will get written.

    • If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.

    • It should have all keys being generated in output already present as tensors.

    • It’s initial state should be either:

      • Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.

      • All tensors are populated and have same length. In this case new samples are appended to the dataset.

  • num_workers (int): The number of workers to use for performing the transform.

    • Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str): The scheduler to be used to compute the transformation.

    • Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool): Displays a progress bar if True (default).

  • skip_ok (bool): If True, skips the check for output tensors generated.

    • This allows the user to skip certain tensors in the function definition.

    • This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to False.

It raises the following errors:

  • InvalidInputDataError: If data_in passed to transform is invalid. It should support __getitem__ and __len__ operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as data_in will also raise this.

  • InvalidOutputDatasetError: If all the tensors of ds_out passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as ds_out will also raise this.

  • TensorMismatchError: If one or more of the outputs generated during transform contain different tensors than the ones present in ‘ds_out’ provided to transform.

  • UnsupportedSchedulerError: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.

  • TransformError: All other exceptions raised if there are problems while running the pipeline.