deeplake
The deeplake package provides a database which stores data as compressed chunked arrays that can be stored anywhere and later streamed to deep learning models.
- deeplake.dataset(path: Union[str, Path], runtime: Optional[Dict] = None, read_only: Optional[bool] = None, overwrite: bool = False, public: bool = False, memory_cache_size: int = 2000, local_cache_size: int = 0, creds: Optional[Union[str, Dict]] = None, token: Optional[str] = None, org_id: Optional[str] = None, verbose: bool = True, access_method: str = 'stream', unlink: bool = False, reset: bool = False, check_integrity: bool = True, lock_enabled: Optional[bool] = True, lock_timeout: Optional[int] = 0)
Returns a
Dataset
object referencing either a new or existing dataset.Examples
>>> ds = deeplake.dataset("hub://username/dataset") >>> ds = deeplake.dataset("s3://mybucket/my_dataset") >>> ds = deeplake.dataset("./datasets/my_dataset", overwrite=True)
Loading to a specfic version:
>>> ds = deeplake.dataset("hub://username/dataset@new_branch") >>> ds = deeplake.dataset("hub://username/dataset@3e49cded62b6b335c74ff07e97f8451a37aca7b2)
>>> my_commit_id = "3e49cded62b6b335c74ff07e97f8451a37aca7b2" >>> ds = deeplake.dataset(f"hub://username/dataset@{my_commit_id}")
- Parameters
path (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.Loading to a specific version:
You can also specify a
commit_id
orbranch
to load the dataset to that version directly by using the@
symbol.The path will then be of the form
hub://username/dataset@{branch}
orhub://username/dataset@{commit_id}
.See examples above.
runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.
read_only (bool, optional) – Opens dataset in read only mode if this is passed as
True
. Defaults toFalse
. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.overwrite (bool) – If set to
True
this overwrites the dataset if it already exists. Defaults toFalse
.public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to
True
.memory_cache_size (int) – The size of the memory cache to be used in MB.
local_cache_size (int) – The size of the local filesystem cache to be used in MB.
creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
org_id (str, Optional) – Organization id to be used for enabling enterprise features. Only applicable for local datasets.
verbose (bool) – If
True
, logs will be printed. Defaults toTrue
.access_method (str) –
The access method to use for the dataset. Can be:
’stream’
Streams the data from the dataset i.e. only fetches data when required. This is the default value.
’download’
Downloads the data to the local filesystem to the path specified in environment variable
DEEPLAKE_DOWNLOAD_PATH
. This will overwriteDEEPLAKE_DOWNLOAD_PATH
.Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set or if the dataset does not exist.The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.
’local’
Downloads the dataset if it doesn’t already exist, otherwise loads from local storage.
Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set.The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.
unlink (bool) – Downloads linked samples if set to
True
. Only applicable ifaccess_method
isdownload
orlocal
. Defaults toFalse
.reset (bool) – If the specified dataset cannot be loaded due to a corrupted HEAD state of the branch being loaded, setting
reset=True
will reset HEAD changes and load the previous version.check_integrity (bool) – If the param is True it will do integrity check during dataset loading otherwise the check is not performed
lock_timeout (int) – Number of seconds to wait before throwing a LockException. If None, wait indefinitely
lock_enabled (bool) – If true, the dataset manages a write lock. NOTE: Only set to False if you are managing concurrent access externally
- Returns
Dataset created using the arguments provided.
- Return type
- Raises
AgreementError – When agreement is rejected
UserNotLoggedInException – When user is not logged in
InvalidTokenException – If the specified token is invalid
TokenPermissionError – When there are permission or other errors related to token
CheckoutError – If version address specified in the path cannot be found
DatasetCorruptError – If loading the dataset failed due to corruption and
reset
is notTrue
ValueError – If version is specified in the path when creating a dataset or If the org id is provided but dataset is ot local, or If the org id is provided but dataset is ot local
ReadOnlyModeError – If reset is attempted in read-only mode
LockedException – When attempting to open a dataset for writing when it is locked by another machine
Exception – Re-raises caught exception if reset cannot fix the issue
Danger
Setting
overwrite
toTrue
will delete all of your data if it exists! Be very careful when setting this parameter.Warning
Setting
access_method
to download will overwrite the local copy of the dataset if it was previously downloaded.Note
Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.
- deeplake.empty(path: Union[str, Path], runtime: Optional[dict] = None, overwrite: bool = False, public: bool = False, memory_cache_size: int = 2000, local_cache_size: int = 0, creds: Optional[Union[str, Dict]] = None, token: Optional[str] = None, org_id: Optional[str] = None, lock_enabled: Optional[bool] = True, lock_timeout: Optional[int] = 0, verbose: bool = True) Dataset
Creates an empty dataset
- Parameters
path (str, pathlib.Path) –
The full path to the dataset. It can be:
a Deep Lake cloud path of the form
hub://org_id/dataset_name
. Requires registration with Deep Lake.an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
runtime (dict) – Parameters for creating a dataset in the Deep Lake Tensor Database. Only applicable for paths of the form
hub://org_id/dataset_name
and runtime must be{"tensor_db": True}
.overwrite (bool) – If set to
True
this overwrites the dataset if it already exists. Defaults toFalse
.public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to
False
.memory_cache_size (int) – The size of the memory cache to be used in MB.
local_cache_size (int) – The size of the local filesystem cache to be used in MB.
creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
org_id (str, Optional) – Organization id to be used for enabling enterprise features. Only applicable for local datasets.
verbose (bool) – If True, logs will be printed. Defaults to True.
lock_timeout (int) – Number of seconds to wait before throwing a LockException. If None, wait indefinitely
lock_enabled (bool) – If true, the dataset manages a write lock. NOTE: Only set to False if you are managing concurrent access externally.
- Returns
Dataset created using the arguments provided.
- Return type
- Raises
DatasetHandlerError – If a Dataset already exists at the given path and overwrite is False.
UserNotLoggedInException – When user is not logged in
InvalidTokenException – If the specified toke is invalid
TokenPermissionError – When there are permission or other errors related to token
ValueError – If version is specified in the path
Danger
Setting
overwrite
toTrue
will delete all of your data if it exists! Be very careful when setting this parameter.
- deeplake.like(dest: Union[str, Path], src: Union[str, Dataset, Path], runtime: Optional[Dict] = None, tensors: Optional[List[str]] = None, overwrite: bool = False, creds: Optional[Union[dict, str]] = None, token: Optional[str] = None, org_id: Optional[str] = None, public: bool = False, verbose: bool = True) Dataset
Creates a new dataset by copying the
source
dataset’s structure to a new location. No samples are copied, only the meta/info for the dataset and it’s tensors.- Parameters
dest – Empty Dataset or Path where the new dataset will be created.
src (Union[str, Dataset]) – Path or dataset object that will be used as the template for the new dataset.
runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.
tensors (List[str], optional) – Names of tensors (and groups) to be replicated. If not specified all tensors in source dataset are considered.
overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.
creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
org_id (str, Optional) – Organization id to be used for enabling enterprise features. Only applicable for local datasets.
public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.
verbose (bool) – If True, logs will be printed. Defaults to
True
.
- Returns
New dataset object.
- Return type
- Raises
ValueError – If
org_id
is specified for a non-local dataset.
- deeplake.ingest_classification(src: Union[str, Path], dest: Union[str, Path], image_params: Optional[Dict] = None, label_params: Optional[Dict] = None, dest_creds: Optional[Union[str, Dict]] = None, progressbar: bool = True, summary: bool = True, num_workers: int = 0, shuffle: bool = True, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs) Dataset
Ingest a dataset of images from a local folder to a Deep Lake Dataset. Images should be stored in subfolders by class name.
- Parameters
src (str, pathlib.Path) – Local path to where the unstructured dataset of images is stored or path to csv file.
dest (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://org_id/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
image_params (Optional[Dict]) – A dictionary containing parameters for the images tensor.
label_params (Optional[Dict]) – A dictionary containing parameters for the labels tensor.
dest_creds (Optional[Union[str, Dict]]) – The string
ENV
or a dictionary containing credentials used to access the destination path of the dataset.progressbar (bool) – Enables or disables ingestion progress bar. Defaults to
True
.summary (bool) – If
True
, a summary of skipped files will be printed after completion. Defaults toTrue
.num_workers (int) – The number of workers to use for ingestion. Set to
0
by default.shuffle (bool) – Shuffles the input data prior to ingestion. Since data arranged in folders by class is highly non-random, shuffling is important in order to produce optimal results when training. Defaults to
True
.token (Optional[str]) – The token to use for accessing the dataset.
connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to
Dataset.connect
.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function see
deeplake.empty()
.
- Returns
New dataset object with structured dataset.
- Return type
- Raises
InvalidPathException – If the source directory does not exist.
SamePathException – If the source and destination path are same.
AutoCompressionError – If the source director is empty or does not contain a valid extension.
InvalidFileExtension – If the most frequent file extension is found to be ‘None’ during auto-compression.
Note
Currently only local source paths and image classification datasets / csv files are supported for automatic ingestion.
Supported filetypes: png/jpeg/jpg/csv.
All files and sub-directories with unsupported filetypes are ignored.
Valid source directory structures for image classification look like:
data/ img0.jpg img1.jpg ...
or:
data/ class0/ cat0.jpg ... class1/ dog0.jpg ... ...
or:
data/ train/ class0/ img0.jpg ... ... val/ class0/ img0.jpg ... ... ...
Classes defined as sub-directories can be accessed at
ds["test/labels"].info.class_names
.Support for train and test sub directories is present under
ds["train/images"]
,ds["train/labels"]
andds["test/images"]
,ds["test/labels"]
.Mapping filenames to classes from an external file is currently not supported.
- deeplake.ingest_coco(images_directory: Union[str, Path], annotation_files: Union[str, Path, List[str]], dest: Union[str, Path], key_to_tensor_mapping: Optional[Dict] = None, file_to_group_mapping: Optional[Dict] = None, ignore_one_group: bool = True, ignore_keys: Optional[List[str]] = None, image_params: Optional[Dict] = None, image_creds_key: Optional[str] = None, src_creds: Optional[Union[str, Dict]] = None, dest_creds: Optional[Union[str, Dict]] = None, inspect_limit: int = 1000000, progressbar: bool = True, shuffle: bool = False, num_workers: int = 0, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs) Dataset
Ingest images and annotations in COCO format to a Deep Lake Dataset. The source data can be stored locally or in the cloud.
Examples
>>> # Ingest local data in COCO format to a Deep Lake dataset stored in Deep Lake storage. >>> ds = deeplake.ingest_coco( >>> "<path/to/images/directory>", >>> ["path/to/annotation/file1.json", "path/to/annotation/file2.json"], >>> dest="hub://org_id/dataset", >>> key_to_tensor_mapping={"category_id": "labels", "bbox": "boxes"}, >>> file_to_group_mapping={"file1.json": "group1", "file2.json": "group2"}, >>> ignore_keys=["area", "image_id", "id"], >>> num_workers=4, >>> ) >>> # Ingest data from your cloud into another Deep Lake dataset in your cloud, and connect that dataset to the Deep Lake backend. >>> ds = deeplake.ingest_coco( >>> "s3://bucket/images/directory", >>> "s3://bucket/annotation/file1.json", >>> dest="s3://bucket/dataset_name", >>> ignore_one_group=True, >>> ignore_keys=["area", "image_id", "id"], >>> image_settings={"name": "images", "htype": "link[image]", "sample_compression": "jpeg"}, >>> image_creds_key="my_s3_managed_credentials", >>> src_creds=aws_creds, # Can also be inferred from environment >>> dest_creds=aws_creds, # Can also be inferred from environment >>> connect_kwargs={"creds_key": "my_s3_managed_credentials", "org_id": "org_id"}, >>> num_workers=4, >>> )
- Parameters
images_directory (str, pathlib.Path) – The path to the directory containing images.
annotation_files (str, pathlib.Path, List[str]) – Path to JSON annotation files in COCO format.
dest (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://org_id/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line), or pass in a token using the ‘token’ parameter.an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
key_to_tensor_mapping (Optional[Dict]) – A one-to-one mapping between COCO keys and Dataset tensor names.
file_to_group_mapping (Optional[Dict]) – A one-to-one mapping between COCO annotation file names and Dataset group names.
ignore_one_group (bool) – Skip creation of group in case of a single annotation file. Set to
False
by default.ignore_keys (List[str]) – A list of COCO keys to ignore.
image_params (Optional[Dict]) – A dictionary containing parameters for the images tensor.
image_creds_key (Optional[str]) – The name of the managed credentials to use for accessing the images in the linked tensor (is applicable).
src_creds (Optional[Union[str, Dict]]) – Credentials to access the source data. If not provided, will be inferred from the environment.
dest_creds (Optional[Union[str, Dict]]) – The string
ENV
or a dictionary containing credentials used to access the destination path of the dataset.inspect_limit (int) – The maximum number of samples to inspect in the annotations json, in order to generate the set of COCO annotation keys. Set to
1000000
by default.progressbar (bool) – Enables or disables ingestion progress bar. Set to
True
by default.shuffle (bool) – Shuffles the input data prior to ingestion. Set to
False
by default.num_workers (int) – The number of workers to use for ingestion. Set to
0
by default.token (Optional[str]) – The token to use for accessing the dataset and/or connecting it to Deep Lake.
connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to
Dataset.connect
.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See
deeplake.empty()
.
- Returns
The Dataset created from images and COCO annotations.
- Return type
- Raises
IngestionError – If either
key_to_tensor_mapping
orfile_to_group_mapping
are not one-to-one.
- deeplake.ingest_yolo(data_directory: Union[str, Path], dest: Union[str, Path], class_names_file: Optional[Union[str, Path]] = None, annotations_directory: Optional[Union[str, Path]] = None, allow_no_annotation: bool = False, image_params: Optional[Dict] = None, label_params: Optional[Dict] = None, coordinates_params: Optional[Dict] = None, src_creds: Optional[Union[str, Dict]] = None, dest_creds: Optional[Union[str, Dict]] = None, image_creds_key: Optional[str] = None, inspect_limit: int = 1000, progressbar: bool = True, shuffle: bool = False, num_workers: int = 0, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs) Dataset
Ingest images and annotations (bounding boxes or polygons) in YOLO format to a Deep Lake Dataset. The source data can be stored locally or in the cloud.
Examples
>>> # Ingest local data in YOLO format to a Deep Lake dataset stored in Deep Lake storage. >>> ds = deeplake.ingest_yolo( >>> "path/to/data/directory", >>> dest="hub://org_id/dataset", >>> allow_no_annotation=True, >>> token="my_activeloop_token", >>> num_workers=4, >>> ) >>> # Ingest data from your cloud into another Deep Lake dataset in your cloud, and connect that dataset to the Deep Lake backend. >>> ds = deeplake.ingest_yolo( >>> "s3://bucket/data_directory", >>> dest="s3://bucket/dataset_name", >>> image_params={"name": "image_links", "htype": "link[image]"}, >>> image_creds_key="my_s3_managed_credentials", >>> src_creds=aws_creds, # Can also be inferred from environment >>> dest_creds=aws_creds, # Can also be inferred from environment >>> connect_kwargs={"creds_key": "my_s3_managed_credentials", "org_id": "org_id"}, >>> num_workers=4, >>> )
- Parameters
data_directory (str, pathlib.Path) – The path to the directory containing the data (images files and annotation files(see ‘annotations_directory’ input for specifying annotations in a separate directory).
dest (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://org_id/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line), or pass in a token using the ‘token’ parameter.an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
class_names_file – Path to the file containing the class names on separate lines. This is typically a file titled classes.names.
annotations_directory (Optional[Union[str, pathlib.Path]]) – Path to directory containing the annotations. If specified, the ‘data_directory’ will not be examined for annotations.
allow_no_annotation (bool) – Flag to determine whether missing annotations files corresponding to an image should be treated as empty annoations. Set to
False
by default.image_params (Optional[Dict]) – A dictionary containing parameters for the images tensor.
label_params (Optional[Dict]) – A dictionary containing parameters for the labels tensor.
coordinates_params (Optional[Dict]) – A dictionary containing parameters for the ccoordinates tensor. This tensor either contains bounding boxes or polygons.
src_creds (Optional[Union[str, Dict]]) – Credentials to access the source data. If not provided, will be inferred from the environment.
dest_creds (Optional[Union[str, Dict]]) – The string
ENV
or a dictionary containing credentials used to access the destination path of the dataset.image_creds_key (Optional[str]) – creds_key for linked tensors, applicable if the htype for the images tensor is specified as ‘link[image]’ in the ‘image_params’ input.
inspect_limit (int) – The maximum number of annotations to inspect, in order to infer whether they are bounding boxes of polygons. This in put is ignored if the htype is specfied in the ‘coordinates_params’.
progressbar (bool) – Enables or disables ingestion progress bar. Set to
True
by default.shuffle (bool) – Shuffles the input data prior to ingestion. Set to
False
by default.num_workers (int) – The number of workers to use for ingestion. Set to
0
by default.token (Optional[str]) – The token to use for accessing the dataset and/or connecting it to Deep Lake.
connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to
Dataset.connect
.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See
deeplake.empty()
.
- Returns
The Dataset created from the images and YOLO annotations.
- Return type
- Raises
IngestionError – If annotations are not found for all the images and ‘allow_no_annotation’ is False
- deeplake.ingest_kaggle(tag: str, src: Union[str, Path], dest: Union[str, Path], exist_ok: bool = False, images_compression: str = 'auto', dest_creds: Optional[Union[str, Dict]] = None, kaggle_credentials: Optional[dict] = None, progressbar: bool = True, summary: bool = True, shuffle: bool = True, **dataset_kwargs) Dataset
Download and ingest a kaggle dataset and store it as a structured dataset to destination.
- Parameters
tag (str) – Kaggle dataset tag. Example:
"coloradokb/dandelionimages"
points to https://www.kaggle.com/coloradokb/dandelionimagessrc (str, pathlib.Path) – Local path to where the raw kaggle dataset will be downlaoded to.
dest (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
exist_ok (bool) – If the kaggle dataset was already downloaded and
exist_ok
isTrue
, ingestion will proceed without error.images_compression (str) – For image classification datasets, this compression will be used for the
images
tensor. Ifimages_compression
is “auto”, compression will be automatically determined by the most common extension in the directory.dest_creds (Optional[Union[str, Dict]]) – The string
ENV
or a dictionary containing credentials used to access the destination path of the dataset.kaggle_credentials (dict) – A dictionary containing kaggle credentials {“username”:”YOUR_USERNAME”, “key”: “YOUR_KEY”}. If
None
, environment variables/the kaggle.json file will be used if available.progressbar (bool) – Enables or disables ingestion progress bar. Set to
True
by default.summary (bool) – Generates ingestion summary. Set to
True
by default.shuffle (bool) – Shuffles the input data prior to ingestion. Since data arranged in folders by class is highly non-random, shuffling is important in order to produce optimal results when training. Defaults to
True
.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See
deeplake.dataset()
.
- Returns
New dataset object with structured dataset.
- Return type
- Raises
SamePathException – If the source and destination path are same.
Note
Currently only local source paths and image classification datasets are supported for automatic ingestion.
- deeplake.ingest_dataframe(src, dest: Union[str, Path], column_params: Optional[Dict] = None, src_creds: Optional[Union[str, Dict]] = None, dest_creds: Optional[Union[str, Dict]] = None, creds_key: Optional[Dict] = None, progressbar: bool = True, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs)
Convert pandas dataframe to a Deep Lake Dataset. The contents of the dataframe can be parsed literally, or can be treated as links to local or cloud files.
Examples
>>> # Ingest local data in COCO format to a Deep Lake dataset stored in Deep Lake storage.
>>> ds = deeplake.ingest_coco( >>> "<path/to/images/directory>", >>> ["path/to/annotation/file1.json", "path/to/annotation/file2.json"], >>> dest="hub://org_id/dataset", >>> key_to_tensor_mapping={"category_id": "labels", "bbox": "boxes"}, >>> file_to_group_mapping={"file1.json": "group1", "file2.json": "group2"}, >>> ignore_keys=["area", "image_id", "id"], >>> num_workers=4, >>> ) >>> # Ingest data from your cloud into another Deep Lake dataset in your cloud, and connect that dataset to the Deep Lake backend.
>>> # Ingest data from a DataFrame into a Deep Lake dataset stored in Deep Lake storage. >>> ds = deeplake.ingest_dataframe( >>> df, >>> dest="hub://org_id/dataset", >>> ) >>> # Ingest data from a DataFrame into a Deep Lake dataset stored in Deep Lake storage. The filenames in `df_column_with_cloud_paths` will be used as the filenames for loading data into the dataset. >>> ds = deeplake.ingest_dataframe( >>> df, >>> dest="hub://org_id/dataset", >>> column_params={"df_column_with_cloud_paths": {"name": "images", "htype": "image"}}, >>> src_creds=aws_creds >>> ) >>> # Ingest data from a DataFrame into a Deep Lake dataset stored in Deep Lake storage. The filenames in `df_column_with_cloud_paths` will be used as the filenames for linked data in the dataset. >>> ds = deeplake.ingest_dataframe( >>> df, >>> dest="hub://org_id/dataset", >>> column_params={"df_column_with_cloud_paths": {"name": "image_links", "htype": "link[image]"}}, >>> creds_key="my_s3_managed_credentials" >>> ) >>> # Ingest data from a DataFrame into a Deep Lake dataset stored in your cloud, and connect that dataset to the Deep Lake backend. The filenames in `df_column_with_cloud_paths` will be used as the filenames for linked data in the dataset. >>> ds = deeplake.ingest_dataframe( >>> df, >>> dest="s3://bucket/dataset_name", >>> column_params={"df_column_with_cloud_paths": {"name": "image_links", "htype": "link[image]"}}, >>> creds_key="my_s3_managed_credentials" >>> connect_kwargs={"creds_key": "my_s3_managed_credentials", "org_id": "org_id"}, >>> )
- Parameters
src (pd.DataFrame) – The pandas dataframe to be converted.
dest (str, pathlib.Path) –
A Dataset or The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.
column_params (Optional[Dict]) – A dictionary containing parameters for the tensors corresponding to the dataframe columns.
src_creds (Optional[Union[str, Dict]]) – Credentials to access the source data. If not provided, will be inferred from the environment.
dest_creds (Optional[Union[str, Dict]]) – The string
ENV
or a dictionary containing credentials used to access the destination path of the dataset.creds_key (Optional[str]) – creds_key for linked tensors, applicable if the htype any tensor is specified as ‘link[…]’ in the ‘column_params’ input.
progressbar (bool) – Enables or disables ingestion progress bar. Set to
True
by default.token (Optional[str]) – The token to use for accessing the dataset.
connect_kwargs (Optional[Dict]) – A dictionary containing arguments to be passed to the dataset connect method. See
Dataset.connect()
.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See
deeplake.empty()
.
- Returns
New dataset created from the dataframe.
- Return type
- Raises
Exception – If
src
is not a valid pandas dataframe object.
- deeplake.ingest_huggingface(src, dest, use_progressbar=True, token: Optional[str] = None, connect_kwargs: Optional[Dict] = None, **dataset_kwargs) Dataset
Converts Hugging Face datasets to Deep Lake format.
- Parameters
src (hfDataset, DatasetDict) – Hugging Face Dataset or DatasetDict to be converted. Data in different splits of a DatasetDict will be stored under respective tensor groups.
dest (Dataset, str, pathlib.Path) – Destination dataset or path to it.
use_progressbar (bool) – Defines if progress bar should be used to show conversion progress.
token (Optional[str]) – The token to use for accessing the dataset and/or connecting it to Deep Lake.
connect_kwargs (Optional[Dict]) – If specified, the dataset will be connected to Deep Lake, and connect_kwargs will be passed to
Dataset.connect
.**dataset_kwargs – Any arguments passed here will be forwarded to the dataset creator function. See
deeplake.empty()
.
- Returns
The destination Deep Lake dataset.
- Return type
- Raises
ValueError – If
dest
is not a path or a Deep LakeDataset
.
Note
if DatasetDict looks like:
>>> { ... train: Dataset({ ... features: ['data'] ... }), ... validation: Dataset({ ... features: ['data'] ... }), ... test: Dataset({ ... features: ['data'] ... }), ... }
it will be converted to a Deep Lake
Dataset
with tensors['train/data', 'validation/data', 'test/data']
.Features of the type
Sequence(feature=Value(dtype='string'))
are not supported. Columns of such type are skipped.
- deeplake.load(path: Union[str, Path], read_only: Optional[bool] = None, memory_cache_size: int = 2000, local_cache_size: int = 0, creds: Optional[Union[dict, str]] = None, token: Optional[str] = None, org_id: Optional[str] = None, verbose: bool = True, access_method: str = 'stream', unlink: bool = False, reset: bool = False, check_integrity: bool = True, lock_timeout: Optional[int] = 0, lock_enabled: Optional[bool] = True) Dataset
Loads an existing dataset
Examples
>>> ds = deeplake.load("hub://username/dataset") >>> ds = deeplake.load("s3://mybucket/my_dataset") >>> ds = deeplake.load("./datasets/my_dataset", overwrite=True)
Loading to a specfic version:
>>> ds = deeplake.load("hub://username/dataset@new_branch") >>> ds = deeplake.load("hub://username/dataset@3e49cded62b6b335c74ff07e97f8451a37aca7b2)
>>> my_commit_id = "3e49cded62b6b335c74ff07e97f8451a37aca7b2" >>> ds = deeplake.load(f"hub://username/dataset@{my_commit_id}")
- Parameters
path (str, pathlib.Path) –
The full path to the dataset. Can be:
a Deep Lake cloud path of the form
hub://username/datasetname
. To write to Deep Lake cloud datasets, ensure that you are logged in to Deep Lake (use ‘activeloop login’ from command line)an s3 path of the form
s3://bucketname/path/to/dataset
. Credentials are required in either the environment or passed to the creds argument.a local file system path of the form
./path/to/dataset
or~/path/to/dataset
orpath/to/dataset
.a memory path of the form
mem://path/to/dataset
which doesn’t save the dataset but keeps it in memory instead. Should be used only for testing as it does not persist.Loading to a specific version:
You can also specify a
commit_id
orbranch
to load the dataset to that version directly by using the@
symbol.The path will then be of the form
hub://username/dataset@{branch}
orhub://username/dataset@{commit_id}
.See examples above.
read_only (bool, optional) – Opens dataset in read only mode if this is passed as
True
. Defaults toFalse
. Datasets stored on Deep Lake cloud that your account does not have write access to will automatically open in read mode.memory_cache_size (int) – The size of the memory cache to be used in MB.
local_cache_size (int) – The size of the local filesystem cache to be used in MB.
creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
org_id (str, Optional) – Organization id to be used for enabling enterprise features. Only applicable for local datasets.
verbose (bool) – If
True
, logs will be printed. Defaults toTrue
.access_method (str) –
The access method to use for the dataset. Can be:
’stream’
Streams the data from the dataset i.e. only fetches data when required. This is the default value.
’download’
Downloads the data to the local filesystem to the path specified in environment variable
DEEPLAKE_DOWNLOAD_PATH
. This will overwriteDEEPLAKE_DOWNLOAD_PATH
.Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set or if the dataset does not exist.The ‘download’ access method can be modified to specify num_workers and/or scheduler. For example: ‘download:2:processed’ will use 2 workers and use processed scheduler, while ‘download:3’ will use 3 workers and default scheduler (threaded), and ‘download:processed’ will use a single worker and use processed scheduler.
’local’
Downloads the dataset if it doesn’t already exist, otherwise loads from local storage.
Raises an exception if
DEEPLAKE_DOWNLOAD_PATH
environment variable is not set.The ‘local’ access method can be modified to specify num_workers and/or scheduler to be used in case dataset needs to be downloaded. If dataset needs to be downloaded, ‘local:2:processed’ will use 2 workers and use processed scheduler, while ‘local:3’ will use 3 workers and default scheduler (threaded), and ‘local:processed’ will use a single worker and use processed scheduler.
unlink (bool) – Downloads linked samples if set to
True
. Only applicable ifaccess_method
isdownload
orlocal
. Defaults toFalse
.reset (bool) – If the specified dataset cannot be loaded due to a corrupted HEAD state of the branch being loaded, setting
reset=True
will reset HEAD changes and load the previous version.check_integrity (bool) – If the param is True it will do integrity check during dataset loading otherwise the check is not performed
- Returns
Dataset loaded using the arguments provided.
- Return type
- Raises
DatasetHandlerError – If a Dataset does not exist at the given path.
AgreementError – When agreement is rejected
UserNotLoggedInException – When user is not logged in
InvalidTokenException – If the specified toke is invalid
TokenPermissionError – When there are permission or other errors related to token
CheckoutError – If version address specified in the path cannot be found
DatasetCorruptError – If loading the dataset failed due to corruption and
reset
is notTrue
ReadOnlyModeError – If reset is attempted in read-only mode
LockedException – When attempting to open a dataset for writing when it is locked by another machine
ValueError – If
org_id
is specified for a non-local datasetException – Re-raises caught exception if reset cannot fix the issue
ValueError – If the org id is provided but the dataset is not local
Warning
Setting
access_method
to download will overwrite the local copy of the dataset if it was previously downloaded.Note
Any changes made to the dataset in download / local mode will only be made to the local copy and will not be reflected in the original dataset.
- deeplake.delete(path: Union[str, Path], force: bool = False, large_ok: bool = False, creds: Optional[Union[dict, str]] = None, token: Optional[str] = None, verbose: bool = False) None
Deletes a dataset at a given path.
- Parameters
path (str, pathlib.Path) – The path to the dataset to be deleted.
force (bool) – Delete data regardless of whether it looks like a deeplake dataset. All data at the path will be removed if set to
True
.large_ok (bool) – Delete datasets larger than 1GB. Disabled by default.
creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
verbose (bool) – If True, logs will be printed. Defaults to True.
- Raises
DatasetHandlerError – If a Dataset does not exist at the given path and
force = False
.UserNotLoggedInException – When user is not logged in.
NotImplementedError – When attempting to delete a managed view.
ValueError – If version is specified in the path
Warning
This is an irreversible operation. Data once deleted cannot be recovered.
- deeplake.rename(old_path: Union[str, Path], new_path: Union[str, Path], creds: Optional[Union[dict, str]] = None, token: Optional[str] = None) Dataset
Renames dataset at
old_path
tonew_path
.Examples
>>> deeplake.rename("hub://username/image_ds", "hub://username/new_ds") >>> deeplake.rename("s3://mybucket/my_ds", "s3://mybucket/renamed_ds")
- Parameters
old_path (str, pathlib.Path) – The path to the dataset to be renamed.
new_path (str, pathlib.Path) – Path to the dataset after renaming.
creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
- Returns
The renamed Dataset.
- Return type
- Raises
DatasetHandlerError – If a Dataset does not exist at the given path or if new path is to a different directory.
- deeplake.copy(src: Union[str, Path, Dataset], dest: Union[str, Path], runtime: Optional[dict] = None, tensors: Optional[List[str]] = None, overwrite: bool = False, src_creds=None, dest_creds=None, token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, **kwargs)
Copies dataset at
src
todest
. Version control history is not included.- Parameters
src (str, Dataset, pathlib.Path) – The Dataset or the path to the dataset to be copied.
dest (str, pathlib.Path) – Destination path to copy to.
runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.
tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.
overwrite (bool) – If True and a dataset exists at
dest
, it will be overwritten. Defaults toFalse
.src_creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.dest_creds (dict, optional) – creds required to create / overwrite datasets at
dest
.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar if True (default).
**kwargs (dict) – Additional keyword arguments
- Returns
New dataset object.
- Return type
- Raises
DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.
UnsupportedParameterException – If a parameter that is no longer supported is specified.
DatasetCorruptError – If loading source dataset fails with DatasetCorruptedError.
- deeplake.deepcopy(src: Union[str, Path, Dataset], dest: Union[str, Path], runtime: Optional[Dict] = None, tensors: Optional[List[str]] = None, overwrite: bool = False, src_creds=None, dest_creds=None, token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, public: bool = False, verbose: bool = True, **kwargs)
Copies dataset at
src
todest
including version control history.- Parameters
src (str, pathlib.Path, Dataset) – The Dataset or the path to the dataset to be copied.
dest (str, pathlib.Path) – Destination path to copy to.
runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.
tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.
overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.
src_creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.dest_creds (dict, optional) – creds required to create / overwrite datasets at
dest
.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar if True (default).
public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to
False
.verbose (bool) – If True, logs will be printed. Defaults to
True
.**kwargs – Additional keyword arguments
- Returns
New dataset object.
- Return type
- Raises
DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.
TypeError – If source is not a dataset.
UnsupportedParameterException – If parameter that is no longer supported is beeing called.
DatasetCorruptError – If loading source dataset fails with DatasetCorruptedError
- deeplake.connect(src_path: str, creds_key: str, dest_path: Optional[str] = None, org_id: Optional[str] = None, ds_name: Optional[str] = None, token: Optional[str] = None) Dataset
Connects dataset at
src_path
to Deep Lake via the provided path.Examples
>>> # Connect an s3 dataset >>> ds = deeplake.connect(src_path="s3://bucket/dataset", dest_path="hub://my_org/dataset", creds_key="my_managed_credentials_key", token="my_activeloop_token") >>> # or >>> ds = deeplake.connect(src_path="s3://bucket/dataset", org_id="my_org", creds_key="my_managed_credentials_key", token="my_activeloop_token")
- Parameters
src_path (str) – Cloud path to the source dataset. Can be: an s3 path like
s3://bucket/path/to/dataset
. a gcs path likegcs://bucket/path/to/dataset
. an azure path likeaz://account_name/container/path/to/dataset
.creds_key (str) – The managed credentials to be used for accessing the source path.
dest_path (str, optional) – The full path to where the connected Deep Lake dataset will reside. Can be: a Deep Lake path like
hub://organization/dataset
org_id (str, optional) – The organization to where the connected Deep Lake dataset will be added.
ds_name (str, optional) – The name of the connected Deep Lake dataset. Will be infered from
dest_path
orsrc_path
if not provided.token (str, optional) – Activeloop token used to fetch the managed credentials.
- Returns
The connected Deep Lake dataset.
- Return type
- Raises
InvalidSourcePathError – If the
src_path
is not a valid s3, gcs or azure path.InvalidDestinationPathError – If
dest_path
, ororg_id
andds_name
do not form a valid Deep Lake path.TokenPermissionError – If the user does not have permission to create a dataset in the specified organization.
- deeplake.exists(path: Union[str, Path], creds: Optional[Union[str, Dict]] = None, token: Optional[str] = None) bool
Checks if a dataset exists at the given
path
.- Parameters
path (str, pathlib.Path) – the path which needs to be checked.
creds (dict, str, optional) – The string
ENV
or a dictionary containing credentials used to access the dataset at the path. - If ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’ are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. - It supports ‘aws_access_key_id’, ‘aws_secret_access_key’, ‘aws_session_token’, ‘endpoint_url’, ‘aws_region’, ‘profile_name’ as keys. - If ‘ENV’ is passed, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets. For datasets connected to hub cloud, specifying ‘ENV’ will override the credentials fetched from Activeloop and use local ones.token (str, optional) – Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.
- Returns
A boolean confirming whether the dataset exists or not at the given path.
- Raises
ValueError – If version is specified in the path
- deeplake.read(path: Union[str, Path], verify: bool = False, creds: Optional[Dict] = None, compression: Optional[str] = None, storage: Optional[StorageProvider] = None) Sample
Utility that reads raw data from supported files into Deep Lake format.
Recompresses data into format required by the tensor if permitted by the tensor htype.
Simply copies the data in the file if file format matches sample_compression of the tensor, thus maximizing upload speeds.
Examples
>>> ds.create_tensor("images", htype="image", sample_compression="jpeg") >>> ds.images.append(deeplake.read("path/to/cat.jpg")) >>> ds.images.shape (1, 399, 640, 3)
>>> ds.create_tensor("videos", htype="video", sample_compression="mp4") >>> ds.videos.append(deeplake.read("path/to/video.mp4")) >>> ds.videos.shape (1, 136, 720, 1080, 3)
>>> ds.create_tensor("images", htype="image", sample_compression="jpeg") >>> ds.images.append(deeplake.read("https://picsum.photos/200/300")) >>> ds.images[0].shape (300, 200, 3)
Supported file types:
Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm" Audio: "flac", "mp3", "wav" Video: "mp4", "mkv", "avi" Dicom: "dcm" Nifti: "nii", "nii.gz"
- Parameters
path (str) – Path to a supported file.
verify (bool) – If True, contents of the file are verified.
creds (optional, Dict) – Credentials for s3, gcp and http urls.
compression (optional, str) – Format of the file. Only required if path does not have an extension.
storage (optional, StorageProvider) – Storage provider to use to retrieve remote files. Useful if multiple files are being read from same storage to minimize overhead of creating a new provider.
- Returns
Sample object. Call
sample.array
to get thenp.ndarray
.- Return type
Note
No data is actually loaded until you try to get a property of the returned
Sample
. This is useful for passing along toTensor.append
andTensor.extend
.
- deeplake.link(path: str, creds_key: Optional[str] = None) LinkedSample
Utility that stores a link to raw data. Used to add data to a Deep Lake Dataset without copying it. See Link htype.
Supported file types:
Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm" Audio: "flac", "mp3", "wav" Video: "mp4", "mkv", "avi" Dicom: "dcm" Nifti: "nii", "nii.gz"
- Parameters
path (str) – Path to a supported file.
creds_key (optional, str) – The credential key to use to read data for this sample. The actual credentials are fetched from the dataset.
- Returns
LinkedSample object that stores path and creds.
- Return type
Examples
>>> ds = deeplake.dataset("test/test_ds") >>> ds.create_tensor("images", htype="link[image]", sample_compression="jpeg") >>> ds.images.append(deeplake.link("https://picsum.photos/200/300"))
See more examples here.
- deeplake.link_tiled(path_array: ndarray, creds_key: Optional[str] = None) LinkedTiledSample
Utility that stores links to multiple images that act as tiles and together form a big image. These images must all have the exact same dimensions. Used to add data to a Deep Lake Dataset without copying it. See Link htype.
Supported file types:
Image: "bmp", "dib", "gif", "ico", "jpeg", "jpeg2000", "pcx", "png", "ppm", "sgi", "tga", "tiff", "webp", "wmf", "xbm"
- Parameters
path_array (np.ndarray) – N dimensional array of paths to the data, with paths corresponding to respective tiles. The array must have dtype=object and have string values. Each string must point to an image file with the same dimensions.
creds_key (optional, str) – The credential key to use to read data for this sample. The actual credentials are fetched from the dataset.
- Returns
LinkedTiledSample object that stores path_array and creds.
- Return type
Examples
>>> ds = deeplake.dataset("test/test_ds") >>> ds.create_tensor("images", htype="link[image]", sample_compression="jpeg") >>> arr = np.empty((10, 10), dtype=object) >>> for j, i in itertools.product(range(10), range(10)): ... arr[j, i] = f"s3://my_bucket/my_image_{j}_{i}.jpeg" ... >>> ds.images.append(deeplake.link_tiled(arr, creds_key="my_s3_key")) >>> # If all images are 1000x1200x3, we now have a 10000x12000x3 image in our dataset.
- deeplake.tiled(sample_shape: Tuple[int, ...], tile_shape: Optional[Tuple[int, ...]] = None, dtype: Union[str, dtype] = dtype('uint8'))
Allocates an empty sample of shape
sample_shape
, broken into tiles of shapetile_shape
(except for edge tiles).Example
>>> with ds: ... ds.create_tensor("image", htype="image", sample_compression="png") ... ds.image.append(deeplake.tiled(sample_shape=(1003, 1103, 3), tile_shape=(10, 10, 3))) ... ds.image[0][-217:, :212, 1:] = np.random.randint(0, 256, (217, 212, 2), dtype=np.uint8)
- Parameters
sample_shape (Tuple[int, ...]) – Full shape of the sample.
tile_shape (Optional, Tuple[int, ...]) – The sample will be will stored as tiles where each tile will have this shape (except edge tiles). If not specified, it will be computed such that each tile is close to half of the tensor’s max_chunk_size (after compression).
dtype (Union[str, np.dtype]) – Dtype for the sample array. Default uint8.
- Returns
A PartialSample instance which can be appended to a Tensor.
- Return type
- deeplake.compute(fn, name: Optional[str] = None) Callable[[...], ComputeFunction]
Compute is a decorator for functions.
The functions should have atleast 2 argument, the first two will correspond to
sample_in
andsamples_out
.There can be as many other arguments as required.
The output should be appended/extended to the second argument in a deeplake like syntax.
Any value returned by the fn will be ignored.
Example:
@deeplake.compute def my_fn(sample_in: Any, samples_out, my_arg0, my_arg1=0): samples_out.my_tensor.append(my_arg0 * my_arg1) # This transform can be used using the eval method in one of these 2 ways:- # Directly evaluating the method # here arg0 and arg1 correspond to the 3rd and 4th argument in my_fn my_fn(arg0, arg1).eval(data_in, ds_out, scheduler="threaded", num_workers=5) # As a part of a Transform pipeline containing other functions pipeline = deeplake.compose([my_fn(a, b), another_function(x=2)]) pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)
The
eval
method evaluates the pipeline/transform function.It has the following arguments:
data_in
: Input passed to the transform to generate output dataset.It should support
__getitem__
and__len__
. This can be a Deep Lake dataset.
ds_out (Dataset, optional)
: The dataset object to which the transform will get written.If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.
It should have all keys being generated in output already present as tensors.
It’s initial state should be either:
Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.
All tensors are populated and have same length. In this case new samples are appended to the dataset.
num_workers (int)
: The number of workers to use for performing the transform.Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str)
: The scheduler to be used to compute the transformation.Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool)
: Displays a progress bar ifTrue
(default).skip_ok (bool)
: IfTrue
, skips the check for output tensors generated.This allows the user to skip certain tensors in the function definition.
This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to
False
.
check_lengths (bool)
: IfTrue
, checks whetherds_out
has tensors of same lengths initially.pad_data_in (bool)
: IfTrue
, pads tensors ofdata_in
to match the length of the largest tensor indata_in
. Defaults toFalse
.ignore_errors (bool)
: IfTrue
, input samples that causes transform to fail will be skipped and the errors will be ignored if possible.
Note
pad_data_in
is only applicable ifdata_in
is a Deep Lake dataset.It raises the following errors:
InvalidInputDataError
: Ifdata_in
passed to transform is invalid. It should support__getitem__
and__len__
operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory asdata_in
will also raise this.InvalidOutputDatasetError
: If all the tensors ofds_out
passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory asds_out
will also raise this.TensorMismatchError
: If one or more of the outputs generated during transform contain different tensors than the ones present inds_out
provided to transform.UnsupportedSchedulerError
: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.TransformError
: All other exceptions raised if there are problems while running the pipeline.
- deeplake.compose(functions: List[ComputeFunction])
Takes a list of functions decorated using
deeplake.compute()
and creates a pipeline that can be evaluated using .evalExample:
pipeline = deeplake.compose([my_fn(a=3), another_function(b=2)]) pipeline.eval(data_in, ds_out, scheduler="processed", num_workers=2)
The
eval
method evaluates the pipeline/transform function.It has the following arguments:
data_in
: Input passed to the transform to generate output dataset.It should support
__getitem__
and__len__
. This can be a Deep Lake dataset.
ds_out (Dataset, optional)
: The dataset object to which the transform will get written.If this is not provided, data_in will be overwritten if it is a Deep Lake dataset, otherwise error will be raised.
It should have all keys being generated in output already present as tensors.
It’s initial state should be either:
Empty i.e. all tensors have no samples. In this case all samples are added to the dataset.
All tensors are populated and have same length. In this case new samples are appended to the dataset.
num_workers (int)
: The number of workers to use for performing the transform.Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str)
: The scheduler to be used to compute the transformation.Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool)
: Displays a progress bar if True (default).skip_ok (bool)
: If True, skips the check for output tensors generated.This allows the user to skip certain tensors in the function definition.
This is especially useful for inplace transformations in which certain tensors are not modified. Defaults to
False
.
ignore_errors (bool)
: IfTrue
, input samples that causes transform to fail will be skipped and the errors will be ignored if possible.
It raises the following errors:
InvalidInputDataError
: If data_in passed to transform is invalid. It should support__getitem__
and__len__
operations. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as data_in will also raise this.InvalidOutputDatasetError
: If all the tensors of ds_out passed to transform don’t have the same length. Using scheduler other than “threaded” with deeplake dataset having base storage as memory as ds_out will also raise this.TensorMismatchError
: If one or more of the outputs generated during transform contain different tensors than the ones present in ‘ds_out’ provided to transform.UnsupportedSchedulerError
: If the scheduler passed is not recognized. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’.TransformError
: All other exceptions raised if there are problems while running the pipeline.