deeplake¶
The deeplake package provides a database which stores data as compressed chunked arrays that can be stored anywhere and later streamed to deep learning models.
- deeplake.dataset(path: str | Path, read_only: bool | None = None, overwrite: bool = False, public: bool = False, memory_cache_size: int = 256, local_cache_size: int = 0, creds: str | Dict | None = None, token: str | None = None, verbose: bool = True, access_method: str = 'stream')¶
Returns a
Dataset
object referencing either a new or existing dataset.Examples
>>> ds = deeplake.dataset("hub://username/dataset") >>> ds = deeplake.dataset("s3://mybucket/my_dataset") >>> ds = deeplake.dataset("./datasets/my_dataset", overwrite=True)
- Parameters:
path (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
read_only (bool, optional) – Opens dataset in read only mode if this is passed as
True
. Defaults toFalse
. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.overwrite (bool) – If set to
True
this overwrites the dataset if it already exists. Defaults toFalse
.public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to
True
.memory_cache_size (int) – The size of the memory cache to be used in MB.
local_cache_size (int) – The size of the local filesystem cache to be used in MB.
creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.
token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
verbose (bool) – If
True
, logs will be printed. Defaults toTrue
.access_method (str) –
The access method to use for the dataset. Can be:
’stream’
Streams the data from the dataset i.e. only fetches data when required. This is the default value.
’download’
Downloads the data to the local filesystem to the path specified in environment variable
DEEPLAKE_DOWNLOAD_PATH
. This will overwriteDEEPLAKE_DOWNLOAD_PATH
.Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set or if the dataset does not exist.The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.
’local’
Downloads dataset if it doesn’t already exist, otherwise loads from local storage.
Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set or the dataset is not found inDEEPLAKE_DOWNLOAD_PATH
.The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.
- Returns:
Dataset created using the arguments provided.
- Return type:
- Raises:
AgreementError – When agreement is rejected
UserNotLoggedInException – When user is not logged in
InvalidTokenException – If the specified token is invalid
TokenPermissionError – When there are permission or other errors related to token
Danger
Setting
overwrite
toTrue
will delete all of your data if it exists! Be very careful when setting this parameter.Warning
Setting
access_method
to download will overwrite the local copy of the dataset if it was previously downloaded.Note
Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.
- deeplake.empty(path: str | Path, overwrite: bool = False, public: bool = False, memory_cache_size: int = 256, local_cache_size: int = 0, creds: dict | None = None, token: str | None = None, verbose: bool = True) Dataset ¶
Creates an empty dataset
- Parameters:
path (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
overwrite (bool) – If set to
True
this overwrites the dataset if it already exists. Defaults toFalse
.public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to
False
.memory_cache_size (int) – The size of the memory cache to be used in MB.
local_cache_size (int) – The size of the local filesystem cache to be used in MB.
creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.
token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
verbose (bool) – If True, logs will be printed. Defaults to True.
- Returns:
Dataset created using the arguments provided.
- Return type:
- Raises:
DatasetHandlerError – If a Dataset already exists at the given path and overwrite is False.
UserNotLoggedInException – When user is not logged in
InvalidTokenException – If the specified toke is invalid
TokenPermissionError – When there are permission or other errors related to token
Danger
Setting
overwrite
toTrue
will delete all of your data if it exists! Be very careful when setting this parameter.
- deeplake.like(dest: str | Path, src: str | Dataset | Path, tensors: List[str] | None = None, overwrite: bool = False, creds: dict | None = None, token: str | None = None, public: bool = False) Dataset ¶
Creates a new dataset by copying the
source
dataset’s structure to a new location. No samples are copied, only the meta/info for the dataset and it’s tensors.- Parameters:
dest – Empty Dataset or Path where the new dataset will be created.
src (Union[str, Dataset]) – Path or dataset object that will be used as the template for the new dataset.
tensors (List[str], optional) – Names of tensors (and groups) to be replicated. If not specified all tensors in source dataset are considered.
overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.
creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.
token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.
- Returns:
New dataset object.
- Return type:
- deeplake.ingest(src: str | Path, dest: str | Path, images_compression: str = 'auto', dest_creds: Dict | None = None, progressbar: bool = True, summary: bool = True, **dataset_kwargs) Dataset ¶
Ingests a dataset from a source and stores it as a structured dataset to destination.
- Parameters:
src (str, pathlib.Path) – Local path to where the unstructured dataset is stored or path to csv file.
dest (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
images_compression (str) – For image classification datasets, this compression will be used for the images tensor. If
images_compression
is “auto”, compression will be automatically determined by the most common extension in the directory.dest_creds (Optional[Dict]) – A dictionary containing credentials used to access the destination path of the dataset.
progressbar (bool) – Enables or disables ingestion progress bar. Defaults to
True
.summary (bool) – If
True
, a summary of skipped files will be printed after completion. Defaults toTrue
.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function.
- Returns:
New dataset object with structured dataset.
- Return type:
- Raises:
InvalidPathException – If the source directory does not exist.
SamePathException – If the source and destination path are same.
AutoCompressionError – If the source director is empty or does not contain a valid extension.
InvalidFileExtension – If the most frequent file extension is found to be ‘None’ during auto-compression.
Note
Currently only local source paths and image classification datasets / csv files are supported for automatic ingestion.
Supported filetypes: png/jpeg/jpg/csv.
All files and sub-directories with unsupported filetypes are ignored.
Valid source directory structures for image classification look like:
data/ img0.jpg img1.jpg ...
or:
data/ class0/ cat0.jpg ... class1/ dog0.jpg ... ...
or:
data/ train/ class0/ img0.jpg ... ... val/ class0/ img0.jpg ... ... ...
Classes defined as sub-directories can be accessed at
ds["test/labels"].info.class_names
.Support for train and test sub directories is present under
ds["train/images"]
,ds["train/labels"]
andds["test/images"]
,ds["test/labels"]
.Mapping filenames to classes from an external file is currently not supported.
- deeplake.ingest_kaggle(tag: str, src: str | Path, dest: str | Path, exist_ok: bool = False, images_compression: str = 'auto', dest_creds: Dict | None = None, kaggle_credentials: dict | None = None, progressbar: bool = True, summary: bool = True, **dataset_kwargs) Dataset ¶
Download and ingest a kaggle dataset and store it as a structured dataset to destination.
- Parameters:
tag (str) – Kaggle dataset tag. Example:
"coloradokb/dandelionimages"
points to https://www.kaggle.com/coloradokb/dandelionimagessrc (str, pathlib.Path) – Local path to where the raw kaggle dataset will be downlaoded to.
dest (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
exist_ok (bool) – If the kaggle dataset was already downloaded and
exist_ok
isTrue
, ingestion will proceed without error.images_compression (str) – For image classification datasets, this compression will be used for the
images
tensor. Ifimages_compression
is “auto”, compression will be automatically determined by the most common extension in the directory.dest_creds (Optional[Dict]) – A dictionary containing credentials used to access the destination path of the dataset.
kaggle_credentials (dict) – A dictionary containing kaggle credentials {“username”:”YOUR_USERNAME”, “key”: “YOUR_KEY”}. If
None
, environment variables/the kaggle.json file will be used if available.progressbar (bool) – Enables or disables ingestion progress bar. Set to
True
by default.summary (bool) – Generates ingestion summary. Set to
True
by default.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See
deeplake.dataset()
.
- Returns:
New dataset object with structured dataset.
- Return type:
- Raises:
SamePathException – If the source and destination path are same.
Note
Currently only local source paths and image classification datasets are supported for automatic ingestion.
- deeplake.ingest_dataframe(src, dest: str | Path | Dataset, dest_creds: Dict | None = None, progressbar: bool = True, **dataset_kwargs)¶
Convert pandas dataframe to a Deep Lake Dataset.
- Parameters:
src (pd.DataFrame) – The pandas dataframe to be converted.
dest (str, pathlib.Path, Dataset) –
A Dataset or The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
dest_creds (Optional[Dict]) – A dictionary containing credentials used to access the destination path of the dataset.
progressbar (bool) – Enables or disables ingestion progress bar. Set to
True
by default.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See
deeplake.dataset()
.
- Returns:
New dataset created from the dataframe.
- Return type:
- Raises:
Exception – If
src
is not a valid pandas dataframe object.
- deeplake.ingest_huggingface(src, dest, use_progressbar=True) Dataset ¶
Converts Hugging Face datasets to Deep Lake format.
- Parameters:
src (hfDataset, DatasetDict) – Hugging Face Dataset or DatasetDict to be converted. Data in different splits of a DatasetDict will be stored under respective tensor groups.
dest (Dataset, str, pathlib.Path) – Destination dataset or path to it.
use_progressbar (bool) – Defines if progress bar should be used to show conversion progress.
- Returns:
The destination Deep Lake dataset.
- Return type:
Note
if DatasetDict looks like:
>>> { ... train: Dataset({ ... features: ['data'] ... }), ... validation: Dataset({ ... features: ['data'] ... }), ... test: Dataset({ ... features: ['data'] ... }), ... }
it will be converted to a Deep Lake
Dataset
with tensors['train/data', 'validation/data', 'test/data']
.Features of the type
Sequence(feature=Value(dtype='string'))
are not supported. Columns of such type are skipped.
- deeplake.load(path: str | Path, read_only: bool | None = None, memory_cache_size: int = 256, local_cache_size: int = 0, creds: dict | None = None, token: str | None = None, verbose: bool = True, access_method: str = 'stream') Dataset ¶
Loads an existing dataset
- Parameters:
path (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
read_only (bool, optional) – Opens dataset in read only mode if this is passed as
True
. Defaults toFalse
. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.memory_cache_size (int) – The size of the memory cache to be used in MB.
local_cache_size (int) – The size of the local filesystem cache to be used in MB.
creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.
token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
verbose (bool) – If
True
, logs will be printed. Defaults toTrue
.access_method (str) –
The access method to use for the dataset. Can be:
’stream’
Streams the data from the dataset i.e. only fetches data when required. This is the default value.
’download’
Downloads the data to the local filesystem to the path specified in environment variable
DEEPLAKE_DOWNLOAD_PATH
. This will overwriteDEEPLAKE_DOWNLOAD_PATH
.Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set or if the dataset does not exist.The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.
’local’
Downloads dataset if it doesn’t already exist, otherwise loads from local storage.
Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set or the dataset is not found inDEEPLAKE_DOWNLOAD_PATH
.The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.
- Returns:
Dataset loaded using the arguments provided.
- Return type:
- Raises:
DatasetHandlerError – If a Dataset does not exist at the given path.
AgreementError – When agreement is rejected
UserNotLoggedInException – When user is not logged in
InvalidTokenException – If the specified toke is invalid
TokenPermissionError – When there are permission or other errors related to token
Warning
Setting
access_method
to download will overwrite the local copy of the dataset if it was previously downloaded.Note
Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.
- deeplake.delete(path: str | Path, force: bool = False, large_ok: bool = False, creds: dict | None = None, token: str | None = None, verbose: bool = False) None ¶
Deletes a dataset at a given path.
- Parameters:
path (str, pathlib.Path) – The path to the dataset to be deleted.
force (bool) – Delete data regardless of whether it looks like a deeplake dataset. All data at the path will be removed if set to
True
.large_ok (bool) – Delete datasets larger than 1GB. Disabled by default.
creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.
token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
verbose (bool) – If True, logs will be printed. Defaults to True.
- Raises:
DatasetHandlerError – If a Dataset does not exist at the given path and
force = False
.NotImplementedError – When attempting to delete a managed view.
Warning
This is an irreversible operation. Data once deleted cannot be recovered.
- deeplake.rename(old_path: str | Path, new_path: str | Path, creds: dict | None = None, token: str | None = None) Dataset ¶
Renames dataset at
old_path
tonew_path
.Examples
>>> deeplake.rename("hub://username/image_ds", "hub://username/new_ds") >>> deeplake.rename("s3://mybucket/my_ds", "s3://mybucket/renamed_ds")
- Parameters:
old_path (str, pathlib.Path) – The path to the dataset to be renamed.
new_path (str, pathlib.Path) – Path to the dataset after renaming.
creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
This takes precedence over credentials present in the environment. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’ and ‘aws_region’ as keys.
token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
- Returns:
The renamed Dataset.
- Return type:
- Raises:
DatasetHandlerError – If a Dataset does not exist at the given path or if new path is to a different directory.
- deeplake.copy(src: str | Path | Dataset, dest: str | Path, tensors: List[str] | None = None, overwrite: bool = False, src_creds=None, src_token=None, dest_creds=None, dest_token=None, num_workers: int = 0, scheduler='threaded', progressbar=True)¶
Copies dataset at
src
todest
. Version control history is not included.- Parameters:
src (Union[str, Dataset, pathlib.Path]) – The Dataset or the path to the dataset to be copied.
dest (str, pathlib.Path) – Destination path to copy to.
tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.
overwrite (bool) – If True and a dataset exists at
dest
, it will be overwritten. Defaults toFalse
.src_creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.
src_token (str, optional) – Activeloop token, used for fetching credentials to the dataset at
src
if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.dest_creds (dict, optional) – creds required to create / overwrite datasets at
dest
.dest_token (str, optional) – token used to for fetching credentials to
dest
.num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar if True (default).
- Returns:
New dataset object.
- Return type:
- Raises:
DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.
- deeplake.deepcopy(src: str | Path, dest: str | Path, tensors: List[str] | None = None, overwrite: bool = False, src_creds=None, src_token=None, dest_creds=None, dest_token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, public: bool = False, verbose: bool = True)¶
Copies dataset at
src
todest
including version control history.- Parameters:
src (str, pathlib.Path) – Path to the dataset to be copied.
dest (str, pathlib.Path) – Destination path to copy to.
tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.
overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.
src_creds (dict, optional) –
A dictionary containing credentials used to access the dataset at the path.
If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys.
src_token (str, optional) – Activeloop token, used for fetching credentials to the dataset at src if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
dest_creds (dict, optional) – creds required to create / overwrite datasets at
dest
.dest_token (str, optional) – token used to for fetching credentials to
dest
.num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar if True (default).
public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to
False
.verbose (bool) – If True, logs will be printed. Defaults to
True
.
- Returns:
New dataset object.
- Return type:
- Raises:
DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.
- deeplake.connect(src_path: str, creds_key: str, dest_path: str | None = None, org_id: str | None = None, ds_name: str | None = None, token: str | None = None) Dataset ¶
Connects dataset at
src_path
to Deep Lake via the provided path.Examples
>>> # Connect an s3 dataset >>> ds = deeplake.connect(src_path="s3://bucket/dataset", dest_path="hub://my_org/dataset", creds_key="my_managed_credentials_key") >>> # or >>> ds = deeplake.connect(src_path="s3://bucket/dataset", org_id="my_org", creds_key="my_managed_credentials_key")
- Parameters:
src_path (str) – Cloud path to the source dataset. Can be: an s3 path like
s3://bucket/path/to/dataset
. a gcs path likegcs://bucket/path/to/dataset
.creds_key (str) – The managed credentials to be used for accessing the source path.
dest_path (str, optional) – The full path to where the connected Deep Lake dataset will reside. Can be: a Deep Lake path like
hub://organization/dataset
org_id (str, optional) – The organization to where the connected Deep Lake dataset will be added.
ds_name (str, optional) – The name of the connected Deep Lake dataset. Will be infered from
dest_path
orsrc_path
if not provided.token (str, optional) – Activeloop token used to fetch the managed credentials.
- Returns:
The connected Deep Lake dataset.
- Return type:
- Raises:
InvalidSourcePathError – If the
src_path
is not a valid s3 or gcs path.InvalidDestinationPathError – If
dest_path
, ororg_id
andds_name
do not form a valid Deep Lake path.
- deeplake.list(workspace: str = '', token: str | None = None) None ¶
List all available Deep Lake cloud datasets.
- Parameters:
workspace (str) – Specify user/organization name. If not given, returns a list of all datasets that can be accessed, regardless of what workspace they are in. Otherwise, lists all datasets in the given workspace.
token (str, optional) – Activeloop token, used for fetching credentials for Deep Lake datasets. This is optional, tokens are normally autogenerated.
- Returns:
List of dataset names.
- Return type:
List
- deeplake.exists(path: str | Path, creds: dict | None = None, token: str | None = None) bool ¶
Checks if a dataset exists at the given
path
.- Parameters:
path (str, pathlib.Path) – the path which needs to be checked.
creds (dict, optional) – A dictionary containing credentials used to access the dataset at the path.
token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
- Returns:
A boolean confirming whether the dataset exists or not at the given path.
- deeplake.read(path: str | Path, verify: bool = False, creds: Dict | None = None, compression: str | None = None, storage: StorageProvider | None = None) Sample ¶
Utility that reads raw data from supported files into Deep Lake format.
Recompresses data into format required by the tensor if permitted by the tensor htype.
Simply copies the data in the file if file format matches sample_compression of the tensor, thus maximizing upload speeds.
Examples
>>> ds.create_tensor("images", htype="image", sample_compression="jpeg") >>> ds.images.append(deeplake.read("path/to/cat.jpg")) >>> ds.images.shape (1, 399, 640, 3)
>>> ds.create_tensor("videos", htype="video", sample_compression="mp4") >>> ds.videos.append(deeplake.read("path/to/video.mp4")) >>> ds.videos.shape (1, 136, 720, 1080, 3)
>>> ds.create_tensor("images", htype="image", sample_compression="jpeg") >>> ds.images.append(deeplake.read("https://picsum.photos/200/300")) >>> ds.images[0].shape (300, 200, 3)
Supported file types:
Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm" Audio: "flac", "mp3", "wav" Video: "mp4", "mkv", "avi" Dicom: "dcm"
- Parameters:
path (str) – Path to a supported file.
verify (bool) – If True, contents of the file are verified.
creds (optional, Dict) – Credentials for s3, gcp and http urls.
compression (optional, str) – Format of the file. Only required if path does not have an extension.
storage (optional, StorageProvider) – Storage provider to use to retrieve remote files. Useful if multiple files are being read from same storage to minimize overhead of creating a new provider.
- Returns:
Sample object. Call
sample.array
to get thenp.ndarray
.- Return type:
Note
No data is actually loaded until you try to get a property of the returned
Sample
. This is useful for passing along toTensor.append
andTensor.extend
.
- deeplake.link(path: str, creds_key: str | None = None) LinkedSample ¶
Utility that stores a link to raw data. Used to add data to a Deep Lake Dataset without copying it. See Link htype.
Supported file types:
Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm" Audio: "flac", "mp3", "wav" Video: "mp4", "mkv", "avi" Dicom: "dcm"
- Parameters:
path (str) – Path to a supported file.
creds_key (optional, str) – The credential key to use to read data for this sample. The actual credentials are fetched from the dataset.
- Returns:
LinkedSample object that stores path and creds.
- Return type:
Examples
>>> ds = deeplake.dataset("test/test_ds") >>> ds.create_tensor("images", htype="link[image]") >>> ds.images.append(deeplake.link("https://picsum.photos/200/300"))
See more examples here.
- deeplake.tiled(sample_shape: Tuple[int, ...], tile_shape: Tuple[int, ...] | None = None, dtype: str | dtype = dtype('uint8'))¶
Allocates an empty sample of shape
sample_shape
, broken into tiles of shapetile_shape
(except for edge tiles).Example
>>> with ds: ... ds.create_tensor("image", htype="image", sample_compression="png") ... ds.image.append(deeplake.tiled(sample_shape=(1003, 1103, 3), tile_shape=(10, 10, 3))) ... ds.image[0][-217:, :212, 1:] = np.random.randint(0, 256, (217, 212, 2), dtype=np.uint8)
- Parameters:
sample_shape (Tuple[int, ...]) – Full shape of the sample.
tile_shape (Optional, Tuple[int, ...]) – The sample will be will stored as tiles where each tile will have this shape (except edge tiles). If not specified, it will be computed such that each tile is close to half of the tensor’s max_chunk_size (after compression).
dtype (Union[str, np.dtype]) – Dtype for the sample array. Default uint8.
- Returns:
A PartialSample instance which can be appended to a Tensor.
- Return type:
- deeplake.compute(fn, name: str | None = None) Callable[[...], ComputeFunction] ¶
Compute is a decorator for functions.
The functions should have atleast 2 argument, the first two will correspond to
sample_in
andsamples_out
.There can be as many other arguments as required.
The output should be appended/extended to the second argument in a deeplake like syntax.
Any value returned by the fn will be ignored.
Example:
@deeplake.compute def my_fn(sample_in: Any, samples_out, my_arg0, my_arg1=0): samples_out.my_tensor.append(my_arg0 * my_arg1) # This transform can be used using the eval method in one of these 2 ways:- # Directly evaluating the method # here arg0 and arg1 correspond to the 3rd and 4th argument in my_fn my_fn(arg0, arg1).eval(data_in, ds_out, scheduler="threaded", num_workers=5) # As a part of a Transform pipeline containing other functions pipeline = deeplake.compose([my_fn(a, b), another_function(x=2)]) pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)
The
eval
method evaluates the pipeline/transform function.It has the following arguments:
data_in
: Input passed to the transform to generate output dataset.It should support
__getitem__
and__len__
. This can be a Deep Lake dataset.
ds_out (Dataset, optional)
: The dataset object to which the transform will get written.If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.
It should have all keys being generated in output already present as tensors.
It’s initial state should be either:
Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.
All tensors are populated and have same length. In this case new samples are appended to the dataset.
num_workers (int)
: The number of workers to use for performing the transform.Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str)
: The scheduler to be used to compute the transformation.Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool)
: Displays a progress bar if True (default).skip_ok (bool)
: If True, skips the check for output tensors generated.This allows the user to skip certain tensors in the function definition.
This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to
False
.
It raises the following errors:
InvalidInputDataError
: If data_in passed to transform is invalid. It should support__getitem__
and__len__
operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as data_in will also raise this.InvalidOutputDatasetError
: If all the tensors of ds_out passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as ds_out will also raise this.TensorMismatchError
: If one or more of the outputs generated during transform contain different tensors than the ones present in ‘ds_out’ provided to transform.UnsupportedSchedulerError
: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.TransformError
: All other exceptions raised if there are problems while running the pipeline.
- deeplake.compose(functions: List[ComputeFunction])¶
Takes a list of functions decorated using
deeplake.compute()
and creates a pipeline that can be evaluated using .evalExample:
pipeline = deeplake.compose([my_fn(a=3), another_function(b=2)]) pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)
The
eval
method evaluates the pipeline/transform function.It has the following arguments:
data_in
: Input passed to the transform to generate output dataset.It should support
__getitem__
and__len__
. This can be a Deep Lake dataset.
ds_out (Dataset, optional)
: The dataset object to which the transform will get written.If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.
It should have all keys being generated in output already present as tensors.
It’s initial state should be either:
Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.
All tensors are populated and have same length. In this case new samples are appended to the dataset.
num_workers (int)
: The number of workers to use for performing the transform.Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str)
: The scheduler to be used to compute the transformation.Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool)
: Displays a progress bar if True (default).skip_ok (bool)
: If True, skips the check for output tensors generated.This allows the user to skip certain tensors in the function definition.
This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to
False
.
It raises the following errors:
InvalidInputDataError
: If data_in passed to transform is invalid. It should support__getitem__
and__len__
operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as data_in will also raise this.InvalidOutputDatasetError
: If all the tensors of ds_out passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as ds_out will also raise this.TensorMismatchError
: If one or more of the outputs generated during transform contain different tensors than the ones present in ‘ds_out’ provided to transform.UnsupportedSchedulerError
: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.TransformError
: All other exceptions raised if there are problems while running the pipeline.