deeplake

The deeplake package provides a database which stores data as compressed chunked arrays that can be stored anywhere and later streamed to deep learning models.

deeplake.dataset(path: Union[str, Path], runtime: Optional[Dict] = None, read_only: Optional[bool] = None, overwrite: bool = False, public: bool = False, memory_cache_size: int = 2000, local_cache_size: int = 0, creds: Optional[Union[Dict, str]] = None, token: Optional[str] = None, org_id: Optional[str] = None, verbose: bool = True, access_method: str = 'stream', unlink: bool = False, reset: bool = False, check_integrity: Optional[bool] = False, lock_enabled: Optional[bool] = True, lock_timeout: Optional[int] = 0, index_params: Optional[Dict[str, Union[int, str]]] = None, indra: bool = False)

Returns a Dataset object referencing either a new or existing dataset.

Examples

>>> ds = deeplake.dataset("hub://username/dataset")
>>> ds = deeplake.dataset("s3://mybucket/my_dataset")
>>> ds = deeplake.dataset("./datasets/my_dataset", overwrite=True)

Loading to a specfic version:

>>> ds = deeplake.dataset("hub://username/dataset@new_branch")
>>> ds = deeplake.dataset("hub://username/dataset@3e49cded62b6b335c74ff07e97f8451a37aca7b2)
>>> my_commit_id = "3e49cded62b6b335c74ff07e97f8451a37aca7b2"
>>> ds = deeplake.dataset(f"hub://username/dataset@{my_commit_id}")
Parameters
  • path (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are authenticated to Deep Lake (pass in a token using the ‘token’ parameter).

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

    • Loading to a specific version:

      • You can also specify a commit_id or branch to load the dataset to that version directly by using the @ symbol.

      • The path will then be of the form hub://username/dataset@{branch} or hub://username/dataset@{commit_id}.

      • See examples above.

  • runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.

  • read_only (bool, optional) – Opens dataset in read only mode if this is passed as True. Defaults to False. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.

  • overwrite (bool) – If set to True this overwrites the dataset if it already exists. Defaults to False.

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to True.

  • memory_cache_size (int) – The size of the memory cache to be used in MB.

  • local_cache_size (int) – The size of the local filesystem cache to be used in MB.

  • creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • org_id (str, Optional) – Organization id to be used for enabling high-performance features. Only applicable for local datasets.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

  • access_method (str) –

    The access method to use for the dataset. Can be:

    • ’stream’

      • Streams the data from the dataset i.e. only fetches data when required. This is the default value.

    • ’download’

      • Downloads the data to the local filesystem to the path specified in environment variable DEEPLAKE_DOWNLOAD_PATH. This will overwrite DEEPLAKE_DOWNLOAD_PATH.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set or if the dataset does not exist.

      • The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.

    • ’local’

      • Downloads the dataset if it doesn’t already exist, otherwise loads from local storage.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set.

      • The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.

  • unlink (bool) – Downloads linked samples if set to True. Only applicable if access_method is download or local. Defaults to False.

  • reset (bool) – If the specified dataset cannot be loaded due to a corrupted HEAD state of the branch being loaded, setting reset=True will reset HEAD changes and load the previous version.

  • check_integrity (bool, Optional) – Performs an integrity check by default (None) if the dataset has 20 or fewer tensors. Set to True to force integrity check, False to skip integrity check.

  • lock_timeout (int) – Number of seconds to wait before throwing a LockException. If None, wait indefinitely

  • lock_enabled (bool) – If true, the dataset manages a write lock. NOTE: Only set to False if you are managing concurrent access externally

  • index_params – Optional[Dict[str, Union[int, str]]] = None : The index parameters used while creating vector store is passed down to dataset.

  • indra (bool) – Flag indicating whether indra api should be used to create the dataset. Defaults to false

Returns

Dataset created using the arguments provided.

Return type

Dataset

Raises
  • AgreementError – When agreement is rejected

  • UserNotLoggedInException – When user is not authenticated

  • InvalidTokenException – If the specified token is invalid

  • TokenPermissionError – When there are permission or other errors related to token

  • CheckoutError – If version address specified in the path cannot be found

  • DatasetCorruptError – If loading the dataset failed due to corruption and reset is not True

  • ValueError – If version is specified in the path when creating a dataset or If the org id is provided but dataset is ot local, or If the org id is provided but dataset is ot local

  • ReadOnlyModeError – If reset is attempted in read-only mode

  • LockedException – When attempting to open a dataset for writing when it is locked by another machine

  • DatasetHandlerError – If overwriting the dataset fails

  • Exception – Re-raises caught exception if reset cannot fix the issue

Danger

Setting overwrite to True will delete all of your data if it exists! Be very careful when setting this parameter.

Warning

Setting access_method to download will overwrite the local copy of the dataset if it was previously downloaded.

Note

Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.

deeplake.empty(path: Union[str, Path], runtime: Optional[dict] = None, overwrite: bool = False, public: bool = False, memory_cache_size: int = 2000, local_cache_size: int = 0, creds: Optional[Union[Dict, str]] = None, token: Optional[str] = None, org_id: Optional[str] = None, lock_enabled: Optional[bool] = True, lock_timeout: Optional[int] = 0, verbose: bool = True, index_params: Optional[Dict[str, Union[int, str]]] = None) Dataset

Creates an empty dataset

Parameters
  • path (str, pathlib.Path) –

    • The full path to the dataset. It can be:

    • a Deep Lake cloud path of the form hub://org_id/dataset_name. Requires registration with Deep Lake.

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • runtime (dict) – Parameters for creating a dataset in the Deep Lake Tensor Database. Only applicable for paths of the form hub://org_id/dataset_name and runtime must be {"tensor_db": True}.

  • overwrite (bool) – If set to True this overwrites the dataset if it already exists. Defaults to False.

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.

  • memory_cache_size (int) – The size of the memory cache to be used in MB.

  • local_cache_size (int) – The size of the local filesystem cache to be used in MB.

  • creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • org_id (str, Optional) – Organization id to be used for enabling high-performance features. Only applicable for local datasets.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

  • lock_timeout (int) – Number of seconds to wait before throwing a LockException. If None, wait indefinitely

  • lock_enabled (bool) – If true, the dataset manages a write lock. NOTE: Only set to False if you are managing concurrent access externally.

  • index_params – Optional[Dict[str, Union[int, str]]]: Index parameters used while creating vector store, passed down to dataset.

Returns

Dataset created using the arguments provided.

Return type

Dataset

Raises

Danger

Setting overwrite to True will delete all of your data if it exists! Be very careful when setting this parameter.

deeplake.like(dest: Union[str, Path], src: Union[str, Dataset, Path], runtime: Optional[Dict] = None, tensors: Optional[List[str]] = None, overwrite: bool = False, creds: Optional[Union[dict, str]] = None, token: Optional[str] = None, org_id: Optional[str] = None, public: bool = False, verbose: bool = True) Dataset

Creates a new dataset by copying the source dataset’s structure to a new location. No samples are copied, only the meta/info for the dataset and it’s tensors.

Parameters
  • dest – Empty Dataset or Path where the new dataset will be created.

  • src (Union[str, Dataset]) – Path or dataset object that will be used as the template for the new dataset.

  • runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.

  • tensors (List[str], optional) – Names of tensors (and groups) to be replicated. If not specified all tensors in source dataset are considered.

  • overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.

  • creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • org_id (str, Optional) – Organization id to be used for enabling high-performance features. Only applicable for local datasets.

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

Returns

New dataset object.

Return type

Dataset

Raises

ValueError – If org_id is specified for a non-local dataset.

deeplake.ingest_classification(src: Union[str, Path], dest: Union[str, Path], image_params: Optional[Dict] = None, label_params: Optional[Dict] = None, dest_creds: Optional[Union[Dict, str]] = None, progressbar: bool = True, summary: bool = True, num_workers: int = 0, shuffle: bool = True, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, indra: bool = False, **dataset_kwargs) Dataset

Ingest a dataset of images from a local folder to a Deep Lake Dataset. Images should be stored in subfolders by class name.

Parameters
  • src (str, pathlib.Path) – Local path to where the unstructured dataset of images is stored or path to csv file.

  • dest (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://org_id/datasetname. To write to Deep Lake cloud datasets, ensure that you are authenticated to Deep Lake (pass in a token using the ‘token’ parameter).

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • image_params (Optional[Dict]) – A dictionary containing parameters for the images tensor.

  • label_params (Optional[Dict]) – A dictionary containing parameters for the labels tensor.

  • dest_creds (Optional[Union[str, Dict]]) – The string ENV or a dictionary containing credentials used to access the destination path of the dataset.

  • progressbar (bool) – Enables or disables ingestion progress bar. Defaults to True.

  • summary (bool) – If True, a summary of skipped files will be printed after completion. Defaults to True.

  • num_workers (int) – The number of workers to use for ingestion. Set to 0 by default.

  • shuffle (bool) – Shuffles the input data prior to ingestion. Since data arranged in folders by class is highly non-random, shuffling is important in order to produce optimal results when training. Defaults to True.

  • token (Optional[str]) – The token to use for accessing the dataset.

  • connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to Dataset.connect.

  • indra (bool) – Flag indicating whether indra api should be used to create the dataset. Defaults to false

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function see deeplake.empty().

Returns

New dataset object with structured dataset.

Return type

Dataset

Raises

Note

  • Currently only local source paths and image classification datasets / csv files are supported for automatic ingestion.

  • Supported filetypes: png/jpeg/jpg/csv.

  • All files and sub-directories with unsupported filetypes are ignored.

  • Valid source directory structures for image classification look like:

    data/
        img0.jpg
        img1.jpg
        ...
    
  • or:

    data/
        class0/
            cat0.jpg
            ...
        class1/
            dog0.jpg
            ...
        ...
    
  • or:

    data/
        train/
            class0/
                img0.jpg
                ...
            ...
        val/
            class0/
                img0.jpg
                ...
            ...
        ...
    
  • Classes defined as sub-directories can be accessed at ds["test/labels"].info.class_names.

  • Support for train and test sub directories is present under ds["train/images"], ds["train/labels"] and ds["test/images"], ds["test/labels"].

  • Mapping filenames to classes from an external file is currently not supported.

deeplake.ingest_coco(images_directory: Union[str, Path], annotation_files: Union[str, Path, List[str]], dest: Union[str, Path], key_to_tensor_mapping: Optional[Dict] = None, file_to_group_mapping: Optional[Dict] = None, ignore_one_group: bool = True, ignore_keys: Optional[List[str]] = None, image_params: Optional[Dict] = None, image_creds_key: Optional[str] = None, src_creds: Optional[Union[Dict, str]] = None, dest_creds: Optional[Union[Dict, str]] = None, inspect_limit: int = 1000000, progressbar: bool = True, shuffle: bool = False, num_workers: int = 0, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs) Dataset

Ingest images and annotations in COCO format to a Deep Lake Dataset. The source data can be stored locally or in the cloud.

Examples

>>> # Ingest local data in COCO format to a Deep Lake dataset stored in Deep Lake storage.
>>> ds = deeplake.ingest_coco(
>>>     "<path/to/images/directory>",
>>>     ["path/to/annotation/file1.json", "path/to/annotation/file2.json"],
>>>     dest="hub://org_id/dataset",
>>>     key_to_tensor_mapping={"category_id": "labels", "bbox": "boxes"},
>>>     file_to_group_mapping={"file1.json": "group1", "file2.json": "group2"},
>>>     ignore_keys=["area", "image_id", "id"],
>>>     num_workers=4,
>>> )
>>> # Ingest data from your cloud into another Deep Lake dataset in your cloud, and connect that dataset to the Deep Lake backend.
>>> ds = deeplake.ingest_coco(
>>>     "s3://bucket/images/directory",
>>>     "s3://bucket/annotation/file1.json",
>>>     dest="s3://bucket/dataset_name",
>>>     ignore_one_group=True,
>>>     ignore_keys=["area", "image_id", "id"],
>>>     image_settings={"name": "images", "htype": "link[image]", "sample_compression": "jpeg"},
>>>     image_creds_key="my_s3_managed_credentials",
>>>     src_creds=aws_creds, # Can also be inferred from environment
>>>     dest_creds=aws_creds, # Can also be inferred from environment
>>>     connect_kwargs={"creds_key": "my_s3_managed_credentials", "org_id": "org_id"},
>>>     num_workers=4,
>>> )
Parameters
  • images_directory (str, pathlib.Path) – The path to the directory containing images.

  • annotation_files (str, pathlib.Path, List[str]) – Path to JSON annotation files in COCO format.

  • dest (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://org_id/datasetname. To write to Deep Lake cloud datasets, ensure that you are authenticated to Deep Lake (pass in a token using the ‘token’ parameter).

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • key_to_tensor_mapping (Optional[Dict]) – A one-to-one mapping between COCO keys and Dataset tensor names.

  • file_to_group_mapping (Optional[Dict]) – A one-to-one mapping between COCO annotation file names and Dataset group names.

  • ignore_one_group (bool) – Skip creation of group in case of a single annotation file. Set to False by default.

  • ignore_keys (List[str]) – A list of COCO keys to ignore.

  • image_params (Optional[Dict]) – A dictionary containing parameters for the images tensor.

  • image_creds_key (Optional[str]) – The name of the managed credentials to use for accessing the images in the linked tensor (is applicable).

  • src_creds (Optional[Union[str, Dict]]) – Credentials to access the source data. If not provided, will be inferred from the environment.

  • dest_creds (Optional[Union[str, Dict]]) – The string ENV or a dictionary containing credentials used to access the destination path of the dataset.

  • inspect_limit (int) – The maximum number of samples to inspect in the annotations json, in order to generate the set of COCO annotation keys. Set to 1000000 by default.

  • progressbar (bool) – Enables or disables ingestion progress bar. Set to True by default.

  • shuffle (bool) – Shuffles the input data prior to ingestion. Set to False by default.

  • num_workers (int) – The number of workers to use for ingestion. Set to 0 by default.

  • token (Optional[str]) – The token to use for accessing the dataset and/or connecting it to Deep Lake.

  • connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to Dataset.connect.

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See deeplake.empty().

Returns

The Dataset created from images and COCO annotations.

Return type

Dataset

Raises

IngestionError – If either key_to_tensor_mapping or file_to_group_mapping are not one-to-one.

deeplake.ingest_yolo(data_directory: Union[str, Path], dest: Union[str, Path], class_names_file: Optional[Union[str, Path]] = None, annotations_directory: Optional[Union[str, Path]] = None, allow_no_annotation: bool = False, image_params: Optional[Dict] = None, label_params: Optional[Dict] = None, coordinates_params: Optional[Dict] = None, src_creds: Optional[Union[Dict, str]] = None, dest_creds: Optional[Union[Dict, str]] = None, image_creds_key: Optional[str] = None, inspect_limit: int = 1000, progressbar: bool = True, shuffle: bool = False, num_workers: int = 0, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs) Dataset

Ingest images and annotations (bounding boxes or polygons) in YOLO format to a Deep Lake Dataset. The source data can be stored locally or in the cloud.

Examples

>>> # Ingest local data in YOLO format to a Deep Lake dataset stored in Deep Lake storage.
>>> ds = deeplake.ingest_yolo(
>>>     "path/to/data/directory",
>>>     dest="hub://org_id/dataset",
>>>     allow_no_annotation=True,
>>>     token="my_activeloop_token",
>>>     num_workers=4,
>>> )
>>> # Ingest data from your cloud into another Deep Lake dataset in your cloud, and connect that dataset to the Deep Lake backend.
>>> ds = deeplake.ingest_yolo(
>>>     "s3://bucket/data_directory",
>>>     dest="s3://bucket/dataset_name",
>>>     image_params={"name": "image_links", "htype": "link[image]"},
>>>     image_creds_key="my_s3_managed_credentials",
>>>     src_creds=aws_creds, # Can also be inferred from environment
>>>     dest_creds=aws_creds, # Can also be inferred from environment
>>>     connect_kwargs={"creds_key": "my_s3_managed_credentials", "org_id": "org_id"},
>>>     num_workers=4,
>>> )
Parameters
  • data_directory (str, pathlib.Path) – The path to the directory containing the data (images files and annotation files(see ‘annotations_directory’ input for specifying annotations in a separate directory).

  • dest (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://org_id/datasetname. To write to Deep Lake cloud datasets, ensure that you are authenticated to Deep Lake (pass in a token using the ‘token’ parameter).

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • class_names_file – Path to the file containing the class names on separate lines. This is typically a file titled classes.names.

  • annotations_directory (Optional[Union[str, pathlib.Path]]) – Path to directory containing the annotations. If specified, the ‘data_directory’ will not be examined for annotations.

  • allow_no_annotation (bool) – Flag to determine whether missing annotations files corresponding to an image should be treated as empty annoations. Set to False by default.

  • image_params (Optional[Dict]) – A dictionary containing parameters for the images tensor.

  • label_params (Optional[Dict]) – A dictionary containing parameters for the labels tensor.

  • coordinates_params (Optional[Dict]) – A dictionary containing parameters for the ccoordinates tensor. This tensor either contains bounding boxes or polygons.

  • src_creds (Optional[Union[str, Dict]]) – Credentials to access the source data. If not provided, will be inferred from the environment.

  • dest_creds (Optional[Union[str, Dict]]) – The string ENV or a dictionary containing credentials used to access the destination path of the dataset.

  • image_creds_key (Optional[str]) – creds_key for linked tensors, applicable if the htype for the images tensor is specified as ‘link[image]’ in the ‘image_params’ input.

  • inspect_limit (int) – The maximum number of annotations to inspect, in order to infer whether they are bounding boxes of polygons. This in put is ignored if the htype is specfied in the ‘coordinates_params’.

  • progressbar (bool) – Enables or disables ingestion progress bar. Set to True by default.

  • shuffle (bool) – Shuffles the input data prior to ingestion. Set to False by default.

  • num_workers (int) – The number of workers to use for ingestion. Set to 0 by default.

  • token (Optional[str]) – The token to use for accessing the dataset and/or connecting it to Deep Lake.

  • connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to Dataset.connect.

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See deeplake.empty().

Returns

The Dataset created from the images and YOLO annotations.

Return type

Dataset

Raises

IngestionError – If annotations are not found for all the images and ‘allow_no_annotation’ is False

deeplake.ingest_kaggle(tag: str, src: Union[str, Path], dest: Union[str, Path], exist_ok: bool = False, images_compression: str = 'auto', dest_creds: Optional[Union[Dict, str]] = None, kaggle_credentials: Optional[dict] = None, progressbar: bool = True, summary: bool = True, shuffle: bool = True, indra: bool = False, **dataset_kwargs) Dataset

Download and ingest a kaggle dataset and store it as a structured dataset to destination.

Parameters
  • tag (str) – Kaggle dataset tag. Example: "coloradokb/dandelionimages" points to https://www.kaggle.com/coloradokb/dandelionimages

  • src (str, pathlib.Path) – Local path to where the raw kaggle dataset will be downlaoded to.

  • dest (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are authenticated to Deep Lake (pass in a token using the ‘token’ parameter).

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • exist_ok (bool) – If the kaggle dataset was already downloaded and exist_ok is True, ingestion will proceed without error.

  • images_compression (str) – For image classification datasets, this compression will be used for the images tensor. If images_compression is “auto”, compression will be automatically determined by the most common extension in the directory.

  • dest_creds (Optional[Union[str, Dict]]) – The string ENV or a dictionary containing credentials used to access the destination path of the dataset.

  • kaggle_credentials (dict) – A dictionary containing kaggle credentials {“username”:”YOUR_USERNAME”, “key”: “YOUR_KEY”}. If None, environment variables/the kaggle.json file will be used if available.

  • progressbar (bool) – Enables or disables ingestion progress bar. Set to True by default.

  • summary (bool) – Generates ingestion summary. Set to True by default.

  • shuffle (bool) – Shuffles the input data prior to ingestion. Since data arranged in folders by class is highly non-random, shuffling is important in order to produce optimal results when training. Defaults to True.

  • indra (bool) – Flag indicating whether indra api should be used to create the dataset. Defaults to false

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See deeplake.dataset().

Returns

New dataset object with structured dataset.

Return type

Dataset

Raises

SamePathException – If the source and destination path are same.

Note

Currently only local source paths and image classification datasets are supported for automatic ingestion.

deeplake.ingest_dataframe(src, dest: Union[str, Path], column_params: Optional[Dict] = None, src_creds: Optional[Union[Dict, str]] = None, dest_creds: Optional[Union[Dict, str]] = None, creds_key: Optional[Dict] = None, progressbar: bool = True, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, indra: bool = False, **dataset_kwargs)

Convert pandas dataframe to a Deep Lake Dataset. The contents of the dataframe can be parsed literally, or can be treated as links to local or cloud files.

Examples

>>> # Ingest data from a DataFrame into a Deep Lake dataset stored in Deep Lake storage.
>>> ds = deeplake.ingest_dataframe(
>>>     df,
>>>     dest="hub://org_id/dataset",
>>> )
>>> # Ingest data from a DataFrame into a Deep Lake dataset stored in Deep Lake storage. The filenames in `df_column_with_cloud_paths` will be used as the filenames for loading data into the dataset.
>>> ds = deeplake.ingest_dataframe(
>>>     df,
>>>     dest="hub://org_id/dataset",
>>>     column_params={"df_column_with_cloud_paths": {"name": "images", "htype": "image"}},
>>>     src_creds=aws_creds
>>> )
>>> # Ingest data from a DataFrame into a Deep Lake dataset stored in Deep Lake storage. The filenames in `df_column_with_cloud_paths` will be used as the filenames for linked data in the dataset.
>>> ds = deeplake.ingest_dataframe(
>>>     df,
>>>     dest="hub://org_id/dataset",
>>>     column_params={"df_column_with_cloud_paths": {"name": "image_links", "htype": "link[image]"}},
>>>     creds_key="my_s3_managed_credentials"
>>> )
>>> # Ingest data from a DataFrame into a Deep Lake dataset stored in your cloud, and connect that dataset to the Deep Lake backend. The filenames in `df_column_with_cloud_paths` will be used as the filenames for linked data in the dataset.
>>> ds = deeplake.ingest_dataframe(
>>>     df,
>>>     dest="s3://bucket/dataset_name",
>>>     column_params={"df_column_with_cloud_paths": {"name": "image_links", "htype": "link[image]"}},
>>>     creds_key="my_s3_managed_credentials"
>>>     connect_kwargs={"creds_key": "my_s3_managed_credentials", "org_id": "org_id"},
>>> )
Parameters
  • src (pd.DataFrame) – The pandas dataframe to be converted.

  • dest (str, pathlib.Path) –

    • A Dataset or The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are authenticated to Deep Lake (pass in a token using the ‘token’ parameter).

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

  • column_params (Optional[Dict]) – A dictionary containing parameters for the tensors corresponding to the dataframe columns.

  • src_creds (Optional[Union[str, Dict]]) – Credentials to access the source data. If not provided, will be inferred from the environment.

  • dest_creds (Optional[Union[str, Dict]]) – The string ENV or a dictionary containing credentials used to access the destination path of the dataset.

  • creds_key (Optional[str]) – creds_key for linked tensors, applicable if the htype any tensor is specified as ‘link[…]’ in the ‘column_params’ input.

  • progressbar (bool) – Enables or disables ingestion progress bar. Set to True by default.

  • token (Optional[str]) – The token to use for accessing the dataset.

  • connect_kwargs (Optional[Dict]) – A dictionary containing arguments to be passed to the dataset connect method. See Dataset.connect().

  • indra (bool) – Flag indicating whether indra api should be used to create the dataset. Defaults to false

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See deeplake.empty().

Returns

New dataset created from the dataframe.

Return type

Dataset

Raises

Exception – If src is not a valid pandas dataframe object.

deeplake.ingest_huggingface(src, dest, use_progressbar=True, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs) Dataset

Converts Hugging Face datasets to Deep Lake format.

Parameters
  • src (hfDataset, DatasetDict) – Hugging Face Dataset or DatasetDict to be converted. Data in different splits of a DatasetDict will be stored under respective tensor groups.

  • dest (Dataset, str, pathlib.Path) – Destination dataset or path to it.

  • use_progressbar (bool) – Defines if progress bar should be used to show conversion progress.

  • token (Optional[str]) – The token to use for accessing the dataset and/or connecting it to Deep Lake.

  • connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to Dataset.connect.

  • **dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See deeplake.empty().

Returns

The destination Deep Lake dataset.

Return type

Dataset

Raises

ValueError – If dest is not a path or a Deep Lake Dataset.

Note

  • if DatasetDict looks like:

    >>> {
    ...    train: Dataset({
    ...        features: ['data']
    ...    }),
    ...    validation: Dataset({
    ...        features: ['data']
    ...    }),
    ...    test: Dataset({
    ...        features: ['data']
    ...    }),
    ... }
    

it will be converted to a Deep Lake Dataset with tensors ['train/data', 'validation/data', 'test/data'].

Features of the type Sequence(feature=Value(dtype='string')) are not supported. Columns of such type are skipped.

deeplake.load(path: Union[str, Path], read_only: Optional[bool] = None, memory_cache_size: int = 2000, local_cache_size: int = 0, creds: Optional[Union[dict, str]] = None, token: Optional[str] = None, org_id: Optional[str] = None, verbose: bool = True, access_method: str = 'stream', unlink: bool = False, reset: bool = False, indra: bool = False, check_integrity: Optional[bool] = None, lock_timeout: Optional[int] = 0, lock_enabled: Optional[bool] = True, index_params: Optional[Dict[str, Union[int, str]]] = None) Dataset

Loads an existing dataset

Examples

>>> ds = deeplake.load("hub://username/dataset")
>>> ds = deeplake.load("s3://mybucket/my_dataset")
>>> ds = deeplake.load("./datasets/my_dataset", overwrite=True)

Loading to a specfic version:

>>> ds = deeplake.load("hub://username/dataset@new_branch")
>>> ds = deeplake.load("hub://username/dataset@3e49cded62b6b335c74ff07e97f8451a37aca7b2)
>>> my_commit_id = "3e49cded62b6b335c74ff07e97f8451a37aca7b2"
>>> ds = deeplake.load(f"hub://username/dataset@{my_commit_id}")
Parameters
  • path (str, pathlib.Path) –

    • The full path to the dataset. Can be:

    • a Deep Lake cloud path of the form hub://username/datasetname. To write to Deep Lake cloud datasets, ensure that you are authenticated to Deep Lake (pass in a token using the ‘token’ parameter).

    • an s3 path of the form s3://bucketname/path/to/dataset. Credentials are required in either the environment or passed to the creds argument.

    • a local file system path of the form ./path/to/dataset or ~/path/to/dataset or path/to/dataset.

    • a memory path of the form mem://path/to/dataset which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.

    • Loading to a specific version:

      • You can also specify a commit_id or branch to load the dataset to that version directly by using the @ symbol.

      • The path will then be of the form hub://username/dataset@{branch} or hub://username/dataset@{commit_id}.

      • See examples above.

  • read_only (bool, optional) – Opens dataset in read only mode if this is passed as True. Defaults to False. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.

  • memory_cache_size (int) – The size of the memory cache to be used in MB.

  • local_cache_size (int) – The size of the local filesystem cache to be used in MB.

  • creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • org_id (str, Optional) – Organization id to be used for enabling high-performance features. Only applicable for local datasets.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

  • access_method (str) –

    The access method to use for the dataset. Can be:

    • ’stream’

      • Streams the data from the dataset i.e. only fetches data when required. This is the default value.

    • ’download’

      • Downloads the data to the local filesystem to the path specified in environment variable DEEPLAKE_DOWNLOAD_PATH. This will overwrite DEEPLAKE_DOWNLOAD_PATH.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set or if the dataset does not exist.

      • The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.

    • ’local’

      • Downloads the dataset if it doesn’t already exist, otherwise loads from local storage.

      • Raises an exception if DEEPLAKE_DOWNLOAD_PATH environment variable is not set.

      • The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.

  • unlink (bool) – Downloads linked samples if set to True. Only applicable if access_method is download or local. Defaults to False.

  • reset (bool) – If the specified dataset cannot be loaded due to a corrupted HEAD state of the branch being loaded, setting reset=True will reset HEAD changes and load the previous version.

  • check_integrity (bool, Optional) – Performs an integrity check by default (None) if the dataset has 20 or fewer tensors. Set to True to force integrity check, False to skip integrity check.

  • indra (bool) – Flag indicating whether indra api should be used to create the dataset. Defaults to false

Returns

Dataset loaded using the arguments provided.

Return type

Dataset

Raises
  • DatasetHandlerError – If a Dataset does not exist at the given path.

  • AgreementError – When agreement is rejected

  • UserNotLoggedInException – When user is not authenticated

  • InvalidTokenException – If the specified toke is invalid

  • TokenPermissionError – When there are permission or other errors related to token

  • CheckoutError – If version address specified in the path cannot be found

  • DatasetCorruptError – If loading the dataset failed due to corruption and reset is not True

  • ReadOnlyModeError – If reset is attempted in read-only mode

  • LockedException – When attempting to open a dataset for writing when it is locked by another machine

  • ValueError – If org_id is specified for a non-local dataset

  • Exception – Re-raises caught exception if reset cannot fix the issue

  • ValueError – If the org id is provided but the dataset is not local

Warning

Setting access_method to download will overwrite the local copy of the dataset if it was previously downloaded.

Note

Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.

deeplake.delete(path: Union[str, Path], force: bool = False, large_ok: bool = False, creds: Optional[Union[dict, str]] = None, token: Optional[str] = None, verbose: bool = False) None

Deletes a dataset at a given path.

Parameters
  • path (str, pathlib.Path) – The path to the dataset to be deleted.

  • force (bool) – Delete data regardless of whether it looks like a deeplake dataset. All data at the path will be removed if set to True.

  • large_ok (bool) – Delete datasets larger than 1GB. Disabled by default.

  • creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

Raises
  • DatasetHandlerError – If a Dataset does not exist at the given path and force = False.

  • UserNotLoggedInException – When user is not authenticated.

  • NotImplementedError – When attempting to delete a managed view.

  • ValueError – If version is specified in the path

Warning

This is an irreversible operation. Data once deleted cannot be recovered.

deeplake.rename(old_path: Union[str, Path], new_path: Union[str, Path], creds: Optional[Union[dict, str]] = None, token: Optional[str] = None) Dataset

Renames dataset at old_path to new_path.

Examples

>>> deeplake.rename("hub://username/image_ds", "hub://username/new_ds")
>>> deeplake.rename("s3://mybucket/my_ds", "s3://mybucket/renamed_ds")
Parameters
  • old_path (str, pathlib.Path) – The path to the dataset to be renamed.

  • new_path (str, pathlib.Path) – Path to the dataset after renaming.

  • creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

Returns

The renamed Dataset.

Return type

Dataset

Raises

DatasetHandlerError – If a Dataset does not exist at the given path or if new path is to a different directory.

deeplake.copy(src: Union[str, Path, Dataset], dest: Union[str, Path], runtime: Optional[dict] = None, tensors: Optional[List[str]] = None, overwrite: bool = False, src_creds=None, dest_creds=None, token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, **kwargs)

Copies dataset at src to dest. Version control history is not included.

Parameters
  • src (str, Dataset, pathlib.Path) – The Dataset or the path to the dataset to be copied.

  • dest (str, pathlib.Path) – Destination path to copy to.

  • runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.

  • tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.

  • overwrite (bool) – If True and a dataset exists at dest, it will be overwritten. Defaults to False.

  • src_creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • dest_creds (dict, optional) – creds required to create / overwrite datasets at dest.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool) – Displays a progress bar if True (default).

  • **kwargs (dict) – Additional keyword arguments

Returns

New dataset object.

Return type

Dataset

Raises
  • DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.

  • UnsupportedParameterException – If a parameter that is no longer supported is specified.

  • DatasetCorruptError – If loading source dataset fails with DatasetCorruptedError.

deeplake.deepcopy(src: Union[str, Path, Dataset], dest: Union[str, Path], runtime: Optional[Dict] = None, tensors: Optional[List[str]] = None, overwrite: bool = False, src_creds=None, dest_creds=None, token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, public: bool = False, verbose: bool = True, **kwargs)

Copies dataset at src to dest including version control history.

Parameters
  • src (str, pathlib.Path, Dataset) – The Dataset or the path to the dataset to be copied.

  • dest (str, pathlib.Path) – Destination path to copy to.

  • runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.

  • tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.

  • overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.

  • src_creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • dest_creds (dict, optional) – creds required to create / overwrite datasets at dest.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

  • num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool) – Displays a progress bar if True (default).

  • public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.

  • verbose (bool) – If True, logs will be printed. Defaults to True.

  • **kwargs – Additional keyword arguments

Returns

New dataset object.

Return type

Dataset

Raises
  • DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.

  • TypeError – If source is not a dataset.

  • UnsupportedParameterException – If parameter that is no longer supported is beeing called.

  • DatasetCorruptError – If loading source dataset fails with DatasetCorruptedError

deeplake.connect(src_path: str, creds_key: str, dest_path: Optional[str] = None, org_id: Optional[str] = None, ds_name: Optional[str] = None, token: Optional[str] = None) Dataset

Connects dataset at src_path to Deep Lake via the provided path.

Examples

>>> # Connect an s3 dataset
>>> ds = deeplake.connect(src_path="s3://bucket/dataset", dest_path="hub://my_org/dataset", creds_key="my_managed_credentials_key", token="my_activeloop_token")
>>> # or
>>> ds = deeplake.connect(src_path="s3://bucket/dataset", org_id="my_org", creds_key="my_managed_credentials_key", token="my_activeloop_token")
Parameters
  • src_path (str) – Cloud path to the source dataset. Can be: an s3 path like s3://bucket/path/to/dataset. a gcs path like gcs://bucket/path/to/dataset. an azure path like az://account_name/container/path/to/dataset.

  • creds_key (str) – The managed credentials to be used for accessing the source path.

  • dest_path (str, optional) – The full path to where the connected Deep Lake dataset will reside. Can be: a Deep Lake path like hub://organization/dataset

  • org_id (str, optional) – The organization to where the connected Deep Lake dataset will be added.

  • ds_name (str, optional) – The name of the connected Deep Lake dataset. Will be infered from dest_path or src_path if not provided.

  • token (str, optional) – Activeloop token used to fetch the managed credentials.

Returns

The connected Deep Lake dataset.

Return type

Dataset

Raises
  • InvalidSourcePathError – If the src_path is not a valid s3, gcs or azure path.

  • InvalidDestinationPathError – If dest_path, or org_id and ds_name do not form a valid Deep Lake path.

  • TokenPermissionError – If the user does not have permission to create a dataset in the specified organization.

deeplake.exists(path: Union[str, Path], creds: Optional[Union[Dict, str]] = None, token: Optional[str] = None) bool

Checks if a dataset exists at the given path.

Parameters
  • path (str, pathlib.Path) – the path which needs to be checked.

  • creds (dict, str, optional) – The string ENV or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.

  • token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

Returns

A boolean confirming whether the dataset exists or not at the given path.

Raises

ValueError – If version is specified in the path

deeplake.read(path: Union[str, Path], verify: bool = False, creds: Optional[Dict] = None, compression: Optional[str] = None, storage: Optional[StorageProvider] = None, timeout: Optional[float] = None) Sample

Utility that reads raw data from supported files into Deep Lake format.

  • Recompresses data into format required by the tensor if permitted by the tensor htype.

  • Simply copies the data in the file if file format matches sample_compression of the tensor, thus maximizing upload speeds.

Examples

>>> ds.create_tensor("images", htype="image", sample_compression="jpeg")
>>> ds.images.append(deeplake.read("path/to/cat.jpg"))
>>> ds.images.shape
(1, 399, 640, 3)
>>> ds.create_tensor("videos", htype="video", sample_compression="mp4")
>>> ds.videos.append(deeplake.read("path/to/video.mp4"))
>>> ds.videos.shape
(1, 136, 720, 1080, 3)
>>> ds.create_tensor("images", htype="image", sample_compression="jpeg")
>>> ds.images.append(deeplake.read("https://picsum.photos/200/300"))
>>> ds.images[0].shape
(300, 200, 3)

Supported file types:

Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
Audio: "flac", "mp3", "wav"
Video: "mp4", "mkv", "avi"
Dicom: "dcm"
Nifti: "nii", "nii.gz"
Parameters
  • path (str) – Path to a supported file.

  • verify (bool) – If True, contents of the file are verified.

  • creds (optional, Dict) – Credentials for s3, gcp and http urls.

  • compression (optional, str) – Format of the file. Only required if path does not have an extension.

  • storage (optional, StorageProvider) – Storage provider to use to retrieve remote files. Useful if multiple files are being read from same storage to minimize overhead of creating a new provider.

  • timeout (optional, float) – Timeout in seconds for reading the file. Applicable only for http(s) urls.

Returns

Sample object. Call sample.array to get the np.ndarray.

Return type

Sample

Note

No data is actually loaded until you try to get a property of the returned Sample. This is useful for passing along to Tensor.append and Tensor.extend.

Utility that stores a link to raw data. Used to add data to a Deep Lake Dataset without copying it. See Link htype.

Supported file types:

Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
Audio: "flac", "mp3", "wav"
Video: "mp4", "mkv", "avi"
Dicom: "dcm"
Nifti: "nii", "nii.gz"
Parameters
  • path (str) – Path to a supported file.

  • creds_key (optional, str) – The credential key to use to read data for this sample. The actual credentials are fetched from the dataset.

Returns

LinkedSample object that stores path and creds.

Return type

LinkedSample

Examples

>>> ds = deeplake.dataset("test/test_ds")
>>> ds.create_tensor("images", htype="link[image]", sample_compression="jpeg")
>>> ds.images.append(deeplake.link("https://picsum.photos/200/300"))

See more examples here.

Utility that stores links to multiple images that act as tiles and together form a big image. These images must all have the exact same dimensions. Used to add data to a Deep Lake Dataset without copying it. See Link htype.

Supported file types:

Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
Parameters
  • path_array (np.ndarray) – N dimensional array of paths to the data, with paths corresponding to respective tiles. The array must have dtype=object and have string values. Each string must point to an image file with the same dimensions.

  • creds_key (optional, str) – The credential key to use to read data for this sample. The actual credentials are fetched from the dataset.

Returns

LinkedTiledSample object that stores path_array and creds.

Return type

LinkedTiledSample

Examples

>>> ds = deeplake.dataset("test/test_ds")
>>> ds.create_tensor("images", htype="link[image]", sample_compression="jpeg")
>>> arr = np.empty((10, 10), dtype=object)
>>> for j, i in itertools.product(range(10), range(10)):
...     arr[j, i] = f"s3://my_bucket/my_image_{j}_{i}.jpeg"
...
>>> ds.images.append(deeplake.link_tiled(arr, creds_key="my_s3_key"))
>>> # If all images are 1000x1200x3, we now have a 10000x12000x3 image in our dataset.
deeplake.tiled(sample_shape: Tuple[int, ...], tile_shape: Optional[Tuple[int, ...]] = None, dtype: Union[str, dtype] = dtype('uint8'))

Allocates an empty sample of shape sample_shape, broken into tiles of shape tile_shape (except for edge tiles).

Example

>>> with ds:
...    ds.create_tensor("image", htype="image", sample_compression="png")
...    ds.image.append(deeplake.tiled(sample_shape=(1003, 1103, 3), tile_shape=(10, 10, 3)))
...    ds.image[0][-217:, :212, 1:] = np.random.randint(0, 256, (217, 212, 2), dtype=np.uint8)
Parameters
  • sample_shape (Tuple[int, ...]) – Full shape of the sample.

  • tile_shape (Optional, Tuple[int, ...]) – The sample will be will stored as tiles where each tile will have this shape (except edge tiles). If not specified, it will be computed such that each tile is close to half of the tensor’s max_chunk_size (after compression).

  • dtype (Union[str, np.dtype]) – Dtype for the sample array. Default uint8.

Returns

A PartialSample instance which can be appended to a Tensor.

Return type

PartialSample

deeplake.compute(fn, name: Optional[str] = None) Callable[[...], ComputeFunction]

Compute is a decorator for functions.

The functions should have atleast 2 argument, the first two will correspond to sample_in and samples_out.

There can be as many other arguments as required.

The output should be appended/extended to the second argument in a deeplake like syntax.

Any value returned by the fn will be ignored.

Example:

@deeplake.compute
def my_fn(sample_in: Any, samples_out, my_arg0, my_arg1=0):
    samples_out.my_tensor.append(my_arg0 * my_arg1)

# This transform can be used using the eval method in one of these 2 ways:-

# Directly evaluating the method
# here arg0 and arg1 correspond to the 3rd and 4th argument in my_fn
my_fn(arg0, arg1).eval(data_in, ds_out, scheduler="threaded", num_workers=5)

# As a part of a Transform pipeline containing other functions
pipeline = deeplake.compose([my_fn(a, b), another_function(x=2)])
pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)

The eval method evaluates the pipeline/transform function.

It has the following arguments:

  • data_in: Input passed to the transform to generate output dataset.

    • It should support __getitem__ and __len__. This can be a Deep Lake dataset.

  • ds_out (Dataset, optional): The dataset object to which the transform will get written.

    • If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.

    • It should have all keys being generated in output already present as tensors.

    • It’s initial state should be either:

      • Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.

      • All tensors are populated and have same length. In this case new samples are appended to the dataset.

  • num_workers (int): The number of workers to use for performing the transform.

    • Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str): The scheduler to be used to compute the transformation.

    • Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool): Displays a progress bar if True (default).

  • skip_ok (bool): If True, skips the check for output tensors generated.

    • This allows the user to skip certain tensors in the function definition.

    • This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to False.

  • check_lengths (bool): If True, checks whether ds_out has tensors of same lengths initially.

  • pad_data_in (bool): If True, pads tensors of data_in to match the length of the largest tensor in data_in. Defaults to False.

  • ignore_errors (bool): If True, input samples that causes transform to fail will be skipped and the errors will be ignored if possible.

Note

pad_data_in is only applicable if data_in is a Deep Lake dataset.

It raises the following errors:

  • InvalidInputDataError: If data_in passed to transform is invalid. It should support __getitem__ and __len__ operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as data_in will also raise this.

  • InvalidOutputDatasetError: If all the tensors of ds_out passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as ds_out will also raise this.

  • TensorMismatchError: If one or more of the outputs generated during transform contain different tensors than the ones present in ds_out provided to transform.

  • UnsupportedSchedulerError: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.

  • TransformError: All other exceptions raised if there are problems while running the pipeline.

deeplake.compose(functions: List[ComputeFunction])

Takes a list of functions decorated using deeplake.compute() and creates a pipeline that can be evaluated using .eval

Example:

pipeline = deeplake.compose([my_fn(a=3), another_function(b=2)])
pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)

The eval method evaluates the pipeline/transform function.

It has the following arguments:

  • data_in: Input passed to the transform to generate output dataset.

    • It should support __getitem__ and __len__. This can be a Deep Lake dataset.

  • ds_out (Dataset, optional): The dataset object to which the transform will get written.

    • If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.

    • It should have all keys being generated in output already present as tensors.

    • It’s initial state should be either:

      • Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.

      • All tensors are populated and have same length. In this case new samples are appended to the dataset.

  • num_workers (int): The number of workers to use for performing the transform.

    • Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.

  • scheduler (str): The scheduler to be used to compute the transformation.

    • Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.

  • progressbar (bool): Displays a progress bar if True (default).

  • skip_ok (bool): If True, skips the check for output tensors generated.

    • This allows the user to skip certain tensors in the function definition.

    • This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to False.

  • ignore_errors (bool): If True, input samples that causes transform to fail will be skipped and the errors will be ignored if possible.

It raises the following errors:

  • InvalidInputDataError: If data_in passed to transform is invalid. It should support __getitem__ and __len__ operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as data_in will also raise this.

  • InvalidOutputDatasetError: If all the tensors of ds_out passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as ds_out will also raise this.

  • TensorMismatchError: If one or more of the outputs generated during transform contain different tensors than the ones present in ‘ds_out’ provided to transform.

  • UnsupportedSchedulerError: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.

  • TransformError: All other exceptions raised if there are problems while running the pipeline.