deeplake.core.dataset

Dataset

class deeplake.core.dataset.Dataset

add_creds_key(creds_key: str, managed: bool = False)

Adds a new creds key to the dataset. These keys are used for tensors that are linked to external data.

Examples

>>> # create/load a dataset
>>> ds = deeplake.empty("path/to/dataset")
>>> # add a new creds key
>>> ds.add_creds_key("my_s3_key")

Parameters

creds_key (str) – The key to be added.
managed (bool) –
- If True, the creds corresponding to the key will be fetched from Activeloop platform.
- Defaults to False.

Raises

ValueError – If the dataset is not connected to Activeloop platform and managed is True.

Note

managed parameter is applicable only for datasets that are connected to Activeloop platform.

property allow_delete: bool: Returns True if dataset can be deleted from storage. Whether it can be deleted or not is stored in the database_meta.json and can be changed with allow_delete = True|False

append(sample: Dict[str, Any], skip_ok: bool = False, append_empty: bool = False)

Append samples to mutliple tensors at once. This method expects all tensors being updated to be of the same length.

Parameters

sample (dict) – Dictionary with tensor names as keys and samples as values.
skip_ok (bool) – Skip tensors not in sample if set to True.
append_empty (bool) – Append empty samples to tensors not specified in sample if set to True. If True, skip_ok is ignored.

Raises

KeyError – If any tensor in the dataset is not a key in sample and skip_ok is False.
TensorDoesNotExistError – If tensor in sample does not exist.
ValueError – If all tensors being updated are not of the same length.
NotImplementedError – If an error occurs while writing tiles.
Exception – Error while attempting to rollback appends.
SampleAppendingError – Error that occurs when someone tries to append a tensor value directly to the dataset without specifying tensor name.

Examples

>>> ds = deeplake.empty("../test/test_ds")
>>> ds.create_tensor('data')
Tensor(key='data')
>>> ds.create_tensor('labels')
Tensor(key='labels')
>>> ds.append({"data": [1, 2, 3, 4], "labels":[0, 1, 2, 3]})

property branch: str: The current branch of the dataset

property branches

Lists all the branches of the dataset.

Returns: List of branches.

checkout(address: str, create: bool = False, reset: bool = False) → Optional[str]

Checks out to a specific commit_id or branch. If create = True, creates a new branch with name address.

Parameters

address (str) – The commit_id or branch to checkout to.
create (bool) – If True, creates a new branch with name as address.
reset (bool) – If checkout fails due to a corrupted HEAD state of the branch, setting reset=True will reset HEAD changes and attempt the checkout again.

Returns

The commit_id of the dataset after checkout.

Return type

Optional[str]

Raises

CheckoutError – If address could not be found.
ReadOnlyModeError – If branch creation or reset is attempted in read-only mode.
DatasetCorruptError – If checkout failed due to dataset corruption and reset is not True.
Exception – If the dataset is a filtered view.

Examples

>>> ds = deeplake.empty("../test/test_ds")
>>> ds.create_tensor("abc")
Tensor(key='abc')
>>> ds.abc.append([1, 2, 3])
>>> first_commit = ds.commit()
>>> ds.checkout("alt", create=True)
'firstdbf9474d461a19e9333c2fd19b46115348f'
>>> ds.abc.append([4, 5, 6])
>>> ds.abc.numpy()
array([[1, 2, 3],
       [4, 5, 6]])
>>> ds.checkout(first_commit)
'firstdbf9474d461a19e9333c2fd19b46115348f'
>>> ds.abc.numpy()
array([[1, 2, 3]])

Note

Checkout from a head node in any branch that contains uncommitted data will lead to an automatic commit before the checkout.

clear_cache()

Flushes (see Dataset.flush()) the contents of the cache layers (if any) and then deletes contents of all the layers of it.
This doesn’t delete data from the actual storage.
This is useful if you have multiple datasets with memory caches open, taking up too much RAM.
Also useful when local cache is no longer needed for certain datasets and is taking up storage space.

property client: Returns the client of the dataset.

commit(message: Optional[str] = None, allow_empty=False) → str

Stores a snapshot of the current state of the dataset.

Parameters

message (str, Optional) – Used to describe the commit.
allow_empty (bool) – If True, commit even if there are no changes.

Returns

the commit id of the saved commit that can be used to access the snapshot.

Return type

str

Raises

Exception – If dataset is a filtered view.
EmptyCommitError – if there are no changes and user does not forced to commit unchanged data.

Note

Commiting from a non-head node in any branch, will lead to an automatic checkout to a new branch.
This same behaviour will happen if new samples are added or existing samples are updated from a non-head node.

property commit_id: Optional[str]: The lasted committed commit id of the dataset. If there are no commits, this returns None.

property commits: List[Dict]

Lists all the commits leading to the current dataset state.

Returns: List of dictionaries containing commit information.

connect(creds_key: str, dest_path: Optional[str] = None, org_id: Optional[str] = None, ds_name: Optional[str] = None, token: Optional[str] = None)

Connect a Deep Lake cloud dataset through a deeplake path.

Examples

>>> # create/load an s3 dataset
>>> s3_ds = deeplake.dataset("s3://bucket/dataset")
>>> ds = s3_ds.connect(dest_path="hub://my_org/dataset", creds_key="my_managed_credentials_key", token="my_activeloop_token)
>>> # or
>>> ds = s3_ds.connect(org_id="my_org", creds_key="my_managed_credentials_key", token="my_activeloop_token")

Parameters

creds_key (str) – The managed credentials to be used for accessing the source path.
dest_path (str, optional) – The full path to where the connected Deep Lake dataset will reside. Can be: a Deep Lake path like hub://organization/dataset
org_id (str, optional) – The organization to where the connected Deep Lake dataset will be added.
ds_name (str, optional) – The name of the connected Deep Lake dataset. Will be infered from dest_path or src_path if not provided.
token (str, optional) – Activeloop token used to fetch the managed credentials.

Raises

InvalidSourcePathError – If the dataset’s path is not a valid s3, gcs or azure path.
InvalidDestinationPathError – If dest_path, or org_id and ds_name do not form a valid Deep Lake path.
TokenPermissionError – If the user does not have permission to create a dataset in the specified organization.

copy(dest: Union[str, Path], runtime: Optional[dict] = None, tensors: Optional[List[str]] = None, overwrite: bool = False, creds=None, token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, public: bool = False)

Copies this dataset or dataset view to dest. Version control history is not included.

Parameters

dest (str, pathlib.Path) – Destination dataset or path to copy to. If a Dataset instance is provided, it is expected to be empty.
tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.
runtime (dict) – Parameters for Activeloop DB Engine. Only applicable for hub:// paths.
overwrite (bool) – If True and a dataset exists at destination, it will be overwritten. Defaults to False.
creds (dict, Optional) – creds required to create / overwrite datasets at dest.
token (str, Optional) – token used to for fetching credentials to dest.
num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar If True (default).
public (bool) – Defines if the dataset will have public access. Applicable only if Deep Lake cloud storage is used and a new Dataset is being created. Defaults to False.

Returns

New dataset object.

Return type

Dataset

Raises

DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.

create_group(name: str, exist_ok=False) → Dataset

Creates a tensor group. Intermediate groups in the path are also created.

Parameters

name – The name of the group to create.
exist_ok – If True, the group is created if it does not exist. If False, an error is raised if the group already exists. Defaults to False.

Returns

The created group.

Raises

TensorGroupAlreadyExistsError – If the group already exists and exist_ok is False.

Examples

>>> ds.create_group("images")
>>> ds['images'].create_tensor("cats")

>>> ds.create_groups("images/jpg/cats")
>>> ds["images"].create_tensor("png")
>>> ds["images/jpg"].create_group("dogs")

create_tensor(name: str, htype: str = 'unspecified', dtype: Union[str, dtype] = 'unspecified', sample_compression: str = 'unspecified', chunk_compression: str = 'unspecified', hidden: bool = False, create_sample_info_tensor: bool = True, create_shape_tensor: bool = True, create_id_tensor: bool = True, verify: bool = True, exist_ok: bool = False, verbose: bool = True, downsampling: Optional[Tuple[int, int]] = None, tiling_threshold: Optional[int] = None, **kwargs)

Creates a new tensor in the dataset.

Examples

>>> # create dataset
>>> ds = deeplake.dataset("path/to/dataset")

>>> # create tensors
>>> ds.create_tensor("images", htype="image", sample_compression="jpg")
>>> ds.create_tensor("videos", htype="video", sample_compression="mp4")
>>> ds.create_tensor("data")
>>> ds.create_tensor("point_clouds", htype="point_cloud")

>>> # append data
>>> ds.images.append(np.ones((400, 400, 3), dtype='uint8'))
>>> ds.videos.append(deeplake.read("videos/sample_video.mp4"))
>>> ds.data.append(np.zeros((100, 100, 2)))

Parameters

name (str) – The name of the tensor to be created.
htype (str) –
- The class of data for the tensor.
- The defaults for other parameters are determined in terms of this value.
- For example, htype="image" would have dtype default to uint8.
- These defaults can be overridden by explicitly passing any of the other parameters to this function.
- May also modify the defaults for other parameters.
dtype (str) – Optionally override this tensor’s dtype. All subsequent samples are required to have this dtype.
sample_compression (str) – All samples will be compressed in the provided format. If None, samples are uncompressed. For link[] tensors, sample_compression is used only for optimizing dataset views.
chunk_compression (str) – All chunks will be compressed in the provided format. If None, chunks are uncompressed. For link[] tensors, chunk_compression is used only for optimizing dataset views.
hidden (bool) – If True, the tensor will be hidden from ds.tensors but can still be accessed via ds[tensor_name].
create_sample_info_tensor (bool) – If True, meta data of individual samples will be saved in a hidden tensor. This data can be accessed via tensor[i].sample_info.
create_shape_tensor (bool) – If True, an associated tensor containing shapes of each sample will be created.
create_id_tensor (bool) – If True, an associated tensor containing unique ids for each sample will be created. This is useful for merge operations.
verify (bool) – Valid only for link htypes. If True, all links will be verified before they are added to the tensor. If False, links will be added without verification but note that create_shape_tensor and create_sample_info_tensor will be set to False.
exist_ok (bool) – If True, the group is created if it does not exist. if False, an error is raised if the group already exists.
verbose (bool) – Shows warnings if True.
downsampling (tuple[int, int]) – If not None, the tensor will be downsampled by the provided factors. For example, (2, 5) will downsample the tensor by a factor of 2 in both dimensions and create 5 layers of downsampled tensors. Only support for image and mask htypes.
tiling_threshold (Optional, int) – In bytes. Tiles large images if their size exceeds this threshold. Set to -1 to disable tiling.
**kwargs –
- htype defaults can be overridden by passing any of the compatible parameters.
- To see all htypes and their correspondent arguments, check out Htypes.

Returns

The new tensor, which can be accessed by dataset[name] or dataset.name.

Return type

Tensor

Raises

TensorAlreadyExistsError – If the tensor already exists and exist_ok is False.
TensorGroupAlreadyExistsError – Duplicate tensor groups are not allowed.
InvalidTensorNameError – If name is in dataset attributes.
NotImplementedError – If trying to override chunk_compression.
TensorMetaInvalidHtype – If invalid htype is specified.
ValueError – If an illegal argument is specified.

create_tensor_like(name: str, source: Tensor, unlink: bool = False) → Tensor

Copies the source tensor’s meta information and creates a new tensor with it. No samples are copied, only the meta/info for the tensor is.

Examples

>>> ds.create_tensor_like("cats", ds["images"])

Parameters

name (str) – Name for the new tensor.
source (Tensor) – Tensor who’s meta/info will be copied. May or may not be contained in the same dataset.
unlink (bool) – Whether to unlink linked tensors.

Returns

New Tensor object.

Return type

Tensor

dataloader(ignore_errors: bool = False, verbose: bool = False)

Returns a DeepLakeDataLoader object.

Parameters

ignore_errors (bool) – If True, the data loader will ignore errors appeared during data iteration otherwise it will collect the statistics and report appeared errors. Default value is False
verbose (bool) – If True, the data loader will dump verbose logs of it’s steps. Default value is False

Returns

A deeplake.enterprise.DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Examples

Creating a simple dataloader object which returns a batch of numpy arrays

>>> import deeplake
>>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> train_loader = ds_train.dataloader().numpy()
>>> for i, data in enumerate(train_loader):
...     # custom logic on data
...     pass

Creating dataloader with custom transformation and batch size

>>> import deeplake
>>> import torch
>>> from torchvision import datasets, transforms, models
>>>
>>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> tform = transforms.Compose([
...     transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
...     transforms.RandomRotation(20), # Image augmentation
...     transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
...     transforms.Normalize([0.5], [0.5]),
... ])
...
>>> batch_size = 32
>>> # create dataloader by chaining with transform function and batch size and returns batch of pytorch tensors
>>> train_loader = ds_train.dataloader()\
...     .transform({'images': tform, 'labels': None})\
...     .batch(batch_size)\
...     .shuffle()\
...     .pytorch()
...
>>> # loop over the elements
>>> for i, data in enumerate(train_loader):
...     # custom logic on data
...     pass

Creating dataloader and chaining with query

>>> ds = deeplake.load('hub://activeloop/coco-train')
>>> train_loader = ds_train.dataloader()\
...     .query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")\
...     .pytorch()
...
>>> # loop over the elements
>>> for i, data in enumerate(train_loader):
...     # custom logic on data
...     pass

delete(large_ok=False)

Deletes the entire dataset from the cache layers (if any) and the underlying storage. This is an IRREVERSIBLE operation. Data once deleted can not be recovered.

Parameters

large_ok (bool) – Delete datasets larger than 1 GB. Defaults to False.

Raises

DatasetTooLargeToDelete – If the dataset is larger than 1 GB and large_ok is False.
DatasetHandlerError – If the dataset is marked as allow_delete=False.

delete_branch(name: str) → None

Deletes the branch and cleans up any unneeded data. Branches can only be deleted if there are no sub-branches and if it has never been merged into another branch.

Parameters

name (str) – The branch to delete.

Raises

CommitError – If branch could not be found.
ReadOnlyModeError – If branch deletion is attempted in read-only mode.
Exception – If you have the given branch currently checked out.

Examples

>>> ds = deeplake.empty("../test/test_ds")
>>> ds.create_tensor("abc")
Tensor(key='abc')
>>> ds.abc.append([1, 2, 3])
>>> first_commit = ds.commit()
>>> ds.checkout("alt", create=True)
'firstdbf9474d461a19e9333c2fd19b46115348f'
>>> ds.abc.append([4, 5, 6])
>>> ds.abc.numpy()
array([[1, 2, 3],
       [4, 5, 6]])
>>> ds.checkout(first_commit)
'firstdbf9474d461a19e9333c2fd19b46115348f'
>>> ds.delete_branch("alt")

delete_group(name: str, large_ok: bool = False)

Delete a tensor group from the dataset.

Examples

>>> ds.delete_group("images/dogs")

Parameters

name (str) – The name of tensor group to be deleted.
large_ok (bool) – Delete tensor groups larger than 1 GB. Disabled by default.

Returns

None

Raises

TensorGroupDoesNotExistError – If tensor group of name name does not exist in the dataset.

delete_tensor(name: str, large_ok: bool = False)

Delete a tensor from the dataset.

Examples

>>> ds.delete_tensor("images/cats")

Parameters

name (str) – The name of tensor to be deleted.
large_ok (bool) – Delete tensors larger than 1 GB. Disabled by default.

Returns

None

Raises

TensorDoesNotExistError – If tensor of name name does not exist in the dataset.
TensorTooLargeToDelete – If the tensor is larger than 1 GB and large_ok is False.

delete_view(id: str)

Deletes the view with given view id.

Parameters: id (str) – Id of the view to delete.
Raises: KeyError – if view with given id does not exist.

diff(id_1: Optional[str] = None, id_2: Optional[str] = None, as_dict=False) → Optional[Dict]

Returns/displays the differences between commits/branches.

For each tensor this contains information about the sample indexes that were added/modified as well as whether the tensor was created.

Parameters

id_1 (str, Optional) – The first commit_id or branch name.
id_2 (str, Optional) – The second commit_id or branch name.
as_dict (bool, Optional) – If True, returns the diff as lists of commit wise dictionaries.

Returns

Optional[Dict]

Raises

ValueError – If id_1 is None and id_2 is not None.

Note

If both id_1 and id_2 are None, the differences between the current state and the previous commit will be calculated. If you’re at the head of the branch, this will show the uncommitted changes, if any.
If only id_1 is provided, the differences between the current state and id_1 will be calculated. If you’re at the head of the branch, this will take into account the uncommitted changes, if any.
If only id_2 is provided, a ValueError will be raised.
If both id_1 and id_2 are provided, the differences between id_1 and id_2 will be calculated.

Note

A dictionary of the differences between the commits/branches is returned if as_dict is True. The dictionary will always have 2 keys, “dataset” and “tensors”. The values corresponding to these keys are detailed below:

If id_1 and id_2 are None, both the keys will have a single list as their value. This list will contain a dictionary describing changes compared to the previous commit.

If only id_1 is provided, both keys will have a tuple of 2 lists as their value. The lists will contain dictionaries describing commitwise differences between commits. The 2 lists will range from current state and id_1 to most recent common ancestor the commits respectively.

If only id_2 is provided, a ValueError will be raised.

If both id_1 and id_2 are provided, both keys will have a tuple of 2 lists as their value. The lists will contain dictionaries describing commitwise differences between commits. The 2 lists will range from id_1 and id_2 to most recent common ancestor the commits respectively.

None is returned if as_dict is False.

extend(samples: Dict[str, Any], skip_ok: bool = False, append_empty: bool = False, ignore_errors: bool = False, progressbar: bool = False)

Appends multiple rows of samples to mutliple tensors at once. This method expects all tensors being updated to be of the same length.

Parameters

samples (Dict[str, Any]) – Dictionary with tensor names as keys and samples as values.
skip_ok (bool) – Skip tensors not in samples if set to True.
append_empty (bool) – Append empty samples to tensors not specified in sample if set to True. If True, skip_ok is ignored.
ignore_errors (bool) – Skip samples that cause errors while extending, if set to True.
progressbar (bool) – Displays a progress bar if set to True.

Raises

KeyError – If any tensor in the dataset is not a key in samples and skip_ok is False.
TensorDoesNotExistError – If tensor in samples does not exist.
ValueError – If all tensors being updated are not of the same length.
NotImplementedError – If an error occurs while writing tiles.
SampleExtendError – If the extend failed while appending a sample.
Exception – Error while attempting to rollback appends.

filter(function: Union[Callable, str], num_workers: int = 0, scheduler: str = 'threaded', progressbar: bool = True, save_result: bool = False, result_path: Optional[str] = None, result_ds_args: Optional[dict] = None)

Filters the dataset in accordance of filter function f(x: sample) -> bool

Parameters

function (Callable, str) – Filter function that takes sample as argument and returns True / False if sample should be included in result. Also supports simplified expression evaluations. See deeplake.core.query.query.DatasetQuery for more details.
num_workers (int) – Level of parallelization of filter evaluations. 0 indicates in-place for-loop evaluation, multiprocessing is used otherwise.
scheduler (str) – Scheduler to use for multiprocessing evaluation. “threaded” is default.
progressbar (bool) – Display progress bar while filtering. True is default.
save_result (bool) – If True, result of the filter will be saved to a dataset asynchronously.
result_path (Optional, str) – Path to save the filter result. Only applicable if save_result is True.
result_ds_args (Optional, dict) – Additional args for result dataset. Only applicable if save_result is True.

Returns

View of Dataset with elements that satisfy filter function.

Example

Return dataset view where all the samples have label equals to 2:

>>> dataset.filter(lambda sample: sample.labels.numpy() == 2)

Append one dataset onto another (only works if their structure is identical):

>>> @deeplake.compute
>>> def dataset_append(sample_in, sample_out):
>>>
>>>     sample_out.append(sample_in.tensors)
>>>
>>>     return sample_out
>>>
>>>
>>> dataset_append().eval(
>>>                 ds_in,
>>>                 ds_out,
>>>                 num_workers = 2
>>>            )

fix_vc(): Rebuilds version control info. To be used when the version control info is corrupted.

flush(): Necessary operation after writes if caches are being used. Writes all the dirty data from the cache layers (if any) to the underlying storage. Here dirty data corresponds to data that has been changed/assigned and but hasn’t yet been sent to the underlying storage.

get_commit_details(commit_id) → Dict

Get details of a particular commit.

Parameters: commit_id (str) – commit id of the commit.
Returns: Dictionary of details with keys - commit, author, time, message.
Return type: Dict
Raises: KeyError – If given commit_id is was not found in the dataset.

get_creds_keys() → Set[str]: Returns the set of creds keys added to the dataset. These are used to fetch external data in linked tensors

get_managed_creds_keys() → List[str]: Returns the list of creds keys added to the dataset that are managed by Activeloop platform. These are used to fetch external data in linked tensors.

get_view(id: str) → ViewEntry

Returns the dataset view corresponding to id.

Examples

>>> # save view
>>> ds[:100].save_view(id="first_100")
>>> # load view
>>> first_100 = ds.get_view("first_100").load()
>>> # 100
>>> print(len(first_100))

See Dataset.save_view() to learn more about saving views.

Parameters: id (str) – id of required view.
Returns: ViewEntry
Raises: KeyError – If no such view exists.

get_views(commit_id: Optional[str] = None) → List[ViewEntry]

Returns list of views stored in this Dataset.

Parameters

commit_id (str, optional) –

Commit from which views should be returned.
If not specified, views from all commits are returned.

Returns

List of ViewEntry instances.

Return type

List[ViewEntry]

property groups: Dict[str, Dataset]: All sub groups in this group

property has_head_changes: Returns True if currently at head node and uncommitted changes are present.

property info: Returns the information about the dataset.

property is_head_node: Returns True if the current commit is the head node of the branch and False otherwise.

property is_view: bool: Returns True if this dataset is a view and False otherwise.

load_view(id: str, optimize: Optional[bool] = False, tensors: Optional[List[str]] = None, num_workers: int = 0, scheduler: str = 'threaded', progressbar: Optional[bool] = True)

Loads the view and returns the Dataset by id. Equivalent to ds.get_view(id).load().

Parameters

id (str) – id of the view to be loaded.
optimize (bool) – If True, the dataset view is optimized by copying and rechunking the required data before loading. This is necessary to achieve fast streaming speeds when training models using the dataset view. The optimization process will take some time, depending on the size of the data.
tensors (Optional, List[str]) – Tensors to be copied if optimize is True. By default all tensors are copied.
num_workers (int) – Number of workers to be used for the optimization process. Only applicable if optimize=True. Defaults to 0.
scheduler (str) – The scheduler to be used for optimization. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Only applicable if optimize=True. Defaults to ‘threaded’.
progressbar (bool) – Whether to use progressbar for optimization. Only applicable if optimize=True. Defaults to True.

Returns

The loaded view.

Return type

Dataset

Raises

KeyError – if view with given id does not exist.

log(): Displays the details of all the past commits.

property max_len: Return the maximum length of the tensor.

property max_view

Returns a view of the dataset in which shorter tensors are padded with None s to have the same length as the longest tensor.

Example

Creating a dataset with 5 images and 4 labels. ds.max_view will return a view with labels tensor padded to have 5 samples.

>>> import deeplake
>>> ds = deeplake.dataset("../test/test_ds", overwrite=True)
>>> ds.create_tensor("images", htype="link[image]", sample_compression="jpg")
>>> ds.create_tensor("labels", htype="class_label")
>>> ds.images.extend([deeplake.link("https://picsum.photos/20/20") for _ in range(5)])
>>> ds.labels.extend([0, 1, 2, 1])
>>> len(ds.images)
5
>>> len(ds.labels)
4
>>> for i, sample in enumerate(ds.max_view):
...     print(sample["images"].shape, sample["labels"].numpy())
...
(20, 20, 3) [0]
(20, 20, 3) [1]
(20, 20, 3) [2]
(20, 20, 3) [1]
(20, 20, 3) [None]

merge(target_id: str, conflict_resolution: Optional[str] = None, delete_removed_tensors: bool = False, force: bool = False)

Merges the target_id into the current dataset.

Parameters

target_id (str) – The commit_id or branch to merge.
conflict_resolution (str, Optional) –
- The strategy to use to resolve merge conflicts.
- Conflicts are scenarios where both the current dataset and the target id have made changes to the same sample/s since their common ancestor.
- Must be one of the following
  - None - this is the default value, will raise an exception if there are conflicts.
  - ”ours” - during conflicts, values from the current dataset will be used.
  - ”theirs” - during conflicts, values from target id will be used.
delete_removed_tensors (bool) – If True, deleted tensors will be deleted from the dataset.
force (bool) –
- Forces merge.
- force=True will have these effects in the following cases of merge conflicts:
  - If tensor is renamed on target but is missing from HEAD, renamed tensor will be registered as a new tensor on current branch.
  - If tensor is renamed on both target and current branch, tensor on target will be registered as a new tensor on current branch.
  - If tensor is renamed on target and a new tensor of the new name was created on the current branch, they will be merged.

Raises

Exception – if dataset is a filtered view.
ValueError – if the conflict resolution strategy is not one of the None, “ours”, or “theirs”.

property meta: DatasetMeta: Returns the metadata of the dataset.

property min_len: Return the minimum length of the tensor.

property min_view

Returns a view of the dataset in which all tensors are sliced to have the same length as the shortest tensor.

Example

Creating a dataset with 5 images and 4 labels. ds.min_view will return a view in which tensors are sliced to have 4 samples.

>>> import deeplake
>>> ds = deeplake.dataset("../test/test_ds", overwrite=True)
>>> ds.create_tensor("images", htype="link[image]", sample_compression="jpg")
>>> ds.create_tensor("labels", htype="class_label")
>>> ds.images.extend([deeplake.link("https://picsum.photos/20/20") for _ in range(5)])
>>> ds.labels.extend([0, 1, 2, 1])
>>> len(ds.images)
5
>>> len(ds.labels)
4
>>> for i, sample in enumerate(ds.max_view):
...     print(sample["images"].shape, sample["labels"].numpy())
...
(20, 20, 3) [0]
(20, 20, 3) [1]
(20, 20, 3) [2]
(20, 20, 3) [1]

property no_view_dataset: Returns the same dataset without slicing.

property num_samples: int: Returns the length of the smallest tensor. Ignores any applied indexing and returns the total length.

property parent: Returns the parent of this group. Returns None if this is the root dataset.

property pending_commit_id: str: The commit_id of the next commit that will be made to the dataset. If you’re not at the head of the current branch, this will be the same as the commit_id.

pop(index: Optional[int] = None)

Removes a sample from all the tensors of the dataset. For any tensor if the index >= len(tensor), the sample won’t be popped from it.

Parameters

index (int, Optional) – The index of the sample to be removed. If it is None, the index becomes the length of the longest tensor - 1.

Raises

ValueError – If duplicate indices are provided.
IndexError – If the index is out of range.

populate_creds(creds_key: str, creds: Optional[dict] = None, from_environment: bool = False)

Populates the creds key added in add_creds_key with the given creds. These creds are used to fetch the external data. This needs to be done everytime the dataset is reloaded for datasets that contain links to external data.

Examples

>>> # create/load a dataset
>>> ds = deeplake.dataset("path/to/dataset")
>>> # add a new creds key
>>> ds.add_creds_key("my_s3_key")
>>> # populate the creds
>>> ds.populate_creds("my_s3_key", {"aws_access_key_id": "my_access_key", "aws_secret_access_key": "my_secret_key"})
>>> # or
>>> ds.populate_creds("my_s3_key", from_environment=True)

pytorch(transform: Optional[Callable] = None, tensors: Optional[Sequence[str]] = None, num_workers: int = 1, batch_size: int = 1, drop_last: bool = False, collate_fn: Optional[Callable] = None, pin_memory: bool = False, shuffle: bool = False, buffer_size: int = 2048, use_local_cache: bool = False, progressbar: bool = False, return_index: bool = True, pad_tensors: bool = False, transform_kwargs: Optional[Dict[str, Any]] = None, decode_method: Optional[Dict[str, str]] = None, cache_size: int = 32000000, *args, **kwargs)

Converts the dataset into a pytorch Dataloader.

Parameters

*args – Additional args to be passed to torch_dataset
**kwargs – Additional kwargs to be passed to torch_dataset
transform (Callable, Optional) – Transformation function to be applied to each sample.
tensors (List, Optional) – Optionally provide a list of tensor names in the ordering that your training script expects. For example, if you have a dataset that has “image” and “label” tensors, if tensors=["image", "label"], your training script should expect each batch will be provided as a tuple of (image, label).
num_workers (int) – The number of workers to use for fetching data in parallel.
batch_size (int) – Number of samples per batch to load. Default value is 1.
drop_last (bool) – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. if False and the size of dataset is not divisible by the batch size, then the last batch will be smaller. Default value is False. Read torch.utils.data.DataLoader docs for more details.
collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset. Read torch.utils.data.DataLoader docs for more details.
pin_memory (bool) – If True, the data loader will copy Tensors into CUDA pinned memory before returning them. Default value is False. Read torch.utils.data.DataLoader docs for more details.
shuffle (bool) – If True, the data loader will shuffle the data indices. Default value is False. Details about how Deep Lake shuffles data can be found at Shuffling in ds.pytorch()
buffer_size (int) – The size of the buffer used to shuffle the data in MBs. Defaults to 2048 MB. Increasing the buffer_size will increase the extent of shuffling.
use_local_cache (bool) – If True, the data loader will use a local cache to store data. The default cache location is ~/.activeloop/cache, but it can be changed by setting the LOCAL_CACHE_PREFIX environment variable. This is useful when the dataset can fit on the machine and we don’t want to fetch the data multiple times for each iteration. Default value is False
progressbar (bool) – If True, tqdm will be wrapped around the returned dataloader. Default value is True.
return_index (bool) – If True, the returned dataloader will have a key “index” that contains the index of the sample(s) in the original dataset. Default value is True.
pad_tensors (bool) – If True, shorter tensors will be padded to the length of the longest tensor. Default value is False.
transform_kwargs (optional, Dict[str, Any]) – Additional kwargs to be passed to transform.
decode_method (Dict[str, str], Optional) –
The method for decoding the Deep Lake tensor data, the result of which is passed to the transform. Decoding occurs outside of the transform so that it can be performed in parallel and as rapidly as possible as per Deep Lake optimizations.
- Supported decode methods are:
  
  ’numpy’
  
  Default behaviour. Returns samples as numpy arrays, the same as ds.tensor[i].numpy()
  
  ’tobytes’
  
  Returns raw bytes of the samples the same as ds.tensor[i].tobytes()
  
  ’data’
  
  Returns a dictionary with keys,values depending on htype, the same as ds.tensor[i].data()
  
  ’pil’
  
  Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression='jpeg' or 'png'.
cache_size (int) – The size of the cache per tensor in MBs. Defaults to max(maximum chunk size of tensor, 32 MB).

Returns: A torch.utils.data.DataLoader object.
Raises: EmptyTensorError – If one or more tensors being passed to pytorch are empty.

Note

Pytorch does not support uint16, uint32, uint64 dtypes. These are implicitly type casted to int32, int64 and int64 respectively. This spins up it’s own workers to fetch data.

query(query_string: str, runtime: Optional[Dict] = None, return_data: bool = False)

Returns a sliced Dataset with given query results.

It allows to run SQL like queries on dataset and extract results. See supported keywords and the Tensor Query Language documentation here.

Parameters

query_string (str) – An SQL string adjusted with new functionalities to run on the given Dataset object
runtime (Optional[Dict]) – Runtime parameters for query execution. Supported keys: {“tensor_db”: True or False}.
return_data (bool) – Defaults to False. Whether to return raw data along with the view.

Raises

ValueError – if return_data is True and runtime is not {“tensor_db”: true}

Returns

A Dataset object.

Return type

Dataset

Examples

Query from dataset all the samples with lables other than 5

>>> import deeplake
>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> query_ds = ds.query("select * where labels != 5")

Query from dataset first appeard 1000 samples where the categories is car and 1000 samples where the categories is motorcycle

>>> ds_train = deeplake.load('hub://activeloop/coco-train')
>>> query_ds_train = ds_train.query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")

random_split(lengths: Sequence[Union[int, float]])

Splits the dataset into non-overlapping Dataset objects of given lengths. If a list of fractions that sum up to 1 is given, the lengths will be computed automatically as floor(frac * len(dataset)) for each fraction provided.

After computing the lengths, if there are any remainders, 1 count will be distributed in round-robin fashion to the lengths until there are no remainders left.

Example

>>> import deeplake
>>> ds = deeplake.dataset("../test/test_ds", overwrite=True)
>>> ds.create_tensor("labels", htype="class_label")
>>> ds.labels.extend([0, 1, 2, 1, 3])
>>> len(ds)
5
>>> train_ds, val_ds = ds.random_split([0.8, 0.2])
>>> len(train_ds)
4
>>> len(val_ds)
1
>>> train_ds, val_ds = ds.random_split([3, 2])
>>> len(train_ds)
3
>>> len(val_ds)
2
>> train_loader = train_ds.pytorch(batch_size=2, shuffle=True)
>> val_loader = val_ds.pytorch(batch_size=2, shuffle=False)

Parameters

lengths (Sequence[Union[int, float]]) – lengths or fractions of splits to be produced.

Returns

a tuple of datasets of the given lengths.

Return type

Tuple[Dataset, …]

Raises

ValueError – If the sum of the lengths is not equal to the length of the dataset.
ValueError – If the dataset has variable length tensors.
ValueError – If lengths are floats and one or more of them are not between 0 and 1.

property read_only: Returns True if dataset is in read-only mode and False otherwise.

rechunk(tensors: Optional[Union[str, List[str]]] = None, num_workers: int = 0, scheduler: str = 'threaded', progressbar: bool = True)

Rewrites the underlying chunks to make their sizes optimal. This is usually needed in cases where a lot of updates have been made to the data.

Parameters

tensors (str, List[str], Optional) – Name/names of the tensors to rechunk. If None, all tensors in the dataset are rechunked.
num_workers (int) – The number of workers to use for rechunking. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for rechunking. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar If True (default).

rename(path: Union[str, Path])

Renames the dataset to path.

Example

>>> ds = deeplake.load("hub://username/dataset")
>>> ds.rename("hub://username/renamed_dataset")

Parameters: path (str, pathlib.Path) – New path to the dataset.
Raises: RenameError – If path points to a different directory.

rename_group(name: str, new_name: str) → None

Renames group with name name to new_name

Parameters

name (str) – Name of group to be renamed.
new_name (str) – New name of group.

Raises

TensorGroupDoesNotExistError – If tensor group of name name does not exist in the dataset.
TensorAlreadyExistsError – Duplicate tensors are not allowed.
TensorGroupAlreadyExistsError – Duplicate tensor groups are not allowed.
InvalidTensorGroupNameError – If name is in dataset attributes.
RenameError – If new_name points to a group different from name.

rename_tensor(name: str, new_name: str) → Tensor

Renames tensor with name name to new_name

Parameters

name (str) – Name of tensor to be renamed.
new_name (str) – New name of tensor.

Returns

Renamed tensor.

Return type

Tensor

Raises

TensorDoesNotExistError – If tensor of name name does not exist in the dataset.
TensorAlreadyExistsError – Duplicate tensors are not allowed.
TensorGroupAlreadyExistsError – Duplicate tensor groups are not allowed.
InvalidTensorNameError – If new_name is in dataset attributes.
RenameError – If new_name points to a group different from name.

reset(force: bool = False): Resets the uncommitted changes present in the branch.

Note

The uncommitted data is deleted from underlying storage, this is not a reversible operation.

property root: Returns the root dataset of a group.

sample_by(weights: Union[str, list, tuple], replace: Optional[bool] = True, size: Optional[int] = None)

Returns a sliced Dataset with given weighted sampler applied.

Parameters

weights – (Union[str, list, tuple]): If it’s string then tql will be run to calculate the weights based on the expression. list and tuple will be treated as the list of the weights per sample.
replace – Optional[bool] If true the samples can be repeated in the result view. Defaults to True
size – Optional[int] The length of the result view. Defaults to length of the dataset.

Returns

A deeplake.Dataset object.

Return type

Dataset

Examples

Sample the dataset with labels == 5 twice more than labels == 6

>>> from deeplake.experimental import query
>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> sampled_ds = ds.sample_by("max_weight(labels == 5: 10, labels == 6: 5)")

Sample the dataset treating labels tensor as weights.

>>> import deeplake
>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> sampled_ds = ds.sample_by("max_weight(labels == 5: 10, labels == 6: 5"))

Sample the dataset with the given weights;

>>> ds = deeplake.load('hub://activeloop/coco-train')
>>> weights = list()
>>> for i in range(len(ds)):
...     weights.append(i % 5)
...
>>> sampled_ds = ds.sample_by(weights, replace=False)

property sample_indices: Returns all the indices pointed to by this dataset view.

save_view(message: Optional[str] = None, path: Optional[Union[str, Path]] = None, id: Optional[str] = None, optimize: bool = False, tensors: Optional[List[str]] = None, num_workers: int = 0, scheduler: str = 'threaded', verbose: bool = True, ignore_errors: bool = False, **ds_args) → str

Saves a dataset view as a virtual dataset (VDS)

Examples

>>> # Save to specified path
>>> vds_path = ds[:10].save_view(path="views/first_10", id="first_10")
>>> vds_path
views/first_10

>>> # Path unspecified
>>> vds_path = ds[:100].save_view(id="first_100", message="first 100 samples")
>>> # vds_path = path/to/dataset

>>> # Random id
>>> vds_path = ds[:100].save_view()
>>> # vds_path = path/to/dataset/.queries/92f41922ed0471ec2d27690b7351fc96bea060e6c5ee22b14f7ffa5f291aa068

See Dataset.get_view() to learn how to load views by id. These virtual datasets can also be loaded from their path like normal datasets.

Parameters

message (Optional, str) – Custom user message.
path (Optional, str, pathlib.Path) –
- The VDS will be saved as a standalone dataset at the specified path.
- If not specified, the VDS is saved under .queries subdirectory of the source dataset’s storage.
id (Optional, str) – Unique id for this view. Random id will be generated if not specified.
optimize (bool) –
- If True, the dataset view will be optimized by copying and rechunking the required data. This is necessary to achieve fast streaming speeds when training models using the dataset view. The optimization process will take some time, depending on the size of the data.
- You can also choose to optimize the saved view later by calling its ViewEntry.optimize() method.
tensors (List, optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.
num_workers (int) – Number of workers to be used for optimization process. Applicable only if optimize=True. Defaults to 0.
scheduler (str) – The scheduler to be used for optimization. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Only applicable if optimize=True. Defaults to ‘threaded’.
verbose (bool) – If True, logs will be printed. Defaults to True.
ignore_errors (bool) – Skip samples that cause errors while saving views. Only applicable if optimize=True. Defaults to False.
ds_args (dict) – Additional args for creating VDS when path is specified. (See documentation for deeplake.dataset())

Returns

Path to the saved VDS.

Return type

str

Raises

ReadOnlyModeError – When attempting to save a view inplace and the user doesn’t have write access.
DatasetViewSavingError – If HEAD node has uncommitted changes.
TypeError – If id is not of type str.

Note

Specifying path makes the view external. External views cannot be accessed using the parent dataset’s Dataset.get_view(), Dataset.load_view(), Dataset.delete_view() methods. They have to be loaded using deeplake.load().

set_token(new_token: str): Method to set a new token

size_approx(): Estimates the size in bytes of the dataset. Includes only content, so will generally return an under-estimate.

summary(force: bool = False)

Prints a summary of the dataset.

Parameters: force (bool) – Dataset views with more than 10000 samples might take a long time to summarize. If force=True, the summary will be printed regardless. An error will be raised otherwise.
Raises: ValueError – If the dataset view might take a long time to summarize and force=False

tensorflow(tensors: Optional[Sequence[str]] = None, tobytes: Union[bool, Sequence[str]] = False, fetch_chunks: bool = True)

Converts the dataset into a tensorflow compatible format.

See https://www.tensorflow.org/api_docs/python/tf/data/Dataset

Parameters

tensors (List, Optional) – Optionally provide a list of tensor names in the ordering that your training script expects. For example, if you have a dataset that has “image” and “label” tensors, if tensors=["image", "label"], your training script should expect each batch will be provided as a tuple of (image, label).
tobytes (bool) – If True, samples will not be decompressed and their raw bytes will be returned instead of numpy arrays. Can also be a list of tensors, in which case those tensors alone will not be decompressed.
fetch_chunks – See fetch_chunks argument in deeplake.core.tensor.Tensor.numpy()

Returns

tf.data.Dataset object that can be used for tensorflow training.

property tensors: Dict[str, Tensor]: All tensors belonging to this group, including those within sub groups. Always returns the sliced tensors.

property token: Get attached token of the dataset

update(sample: Dict[str, Any])

Update existing samples in the dataset with new values.

Examples

>>> ds[0].update({"images": deeplake.read("new_image.png"), "labels": 1})

>>> new_images = [deeplake.read(f"new_image_{i}.png") for i in range(3)]
>>> ds[:3].update({"images": new_images, "labels": [1, 2, 3]})

Parameters

sample (dict) – Dictionary with tensor names as keys and samples as values.

Raises

ValueError – If partial update of a sample is attempted.
Exception – Error while attempting to rollback updates.

update_creds_key(creds_key: str, new_creds_key: Optional[str] = None, managed: Optional[bool] = None)

Updates the name and/or management status of a creds key.

Parameters

creds_key (str) – The key whose name and/or management status is to be changed.
new_creds_key (str, optional) – The new key to replace the old key. If not provided, the old key will be used.
managed (bool) – The target management status. If True, the creds corresponding to the key will be fetched from activeloop platform.

Raises

ValueError – If the dataset is not connected to activeloop platform.
ValueError – If both new_creds_key and managed are None.
KeyError – If the creds key is not present in the dataset.

Examples

>>> # create/load a dataset
>>> ds = deeplake.dataset("path/to/dataset")
>>> # add a new creds key
>>> ds.add_creds_key("my_s3_key")
>>> # Populate the name added with creds dictionary
>>> # These creds are only present temporarily and will have to be repopulated on every reload
>>> ds.populate_creds("my_s3_key", {})
>>> # Rename the key and change the management status of the key to True. Before doing this, ensure that the creds have been created on activeloop platform
>>> # Now, this key will no longer use the credentials populated in the previous step but will instead fetch them from activeloop platform
>>> # These creds don't have to be populated again on every reload and will be fetched every time the dataset is loaded
>>> ds.update_creds_key("my_s3_key", "my_managed_key", True)

visualize(width: Optional[Union[int, str]] = None, height: Optional[Union[int, str]] = None)

Visualizes the dataset in the Jupyter notebook.

Parameters

width – Union[int, str, None] Optional width of the visualizer canvas.
height – Union[int, str, None] Optional height of the visualizer canvas.

Raises

Exception – If the dataset is not a Deep Lake cloud dataset and the visualization is attempted in colab.

DeepLakeCloudDataset

class deeplake.core.dataset.DeepLakeCloudDataset

Bases: Dataset

Subclass of Dataset. Deep Lake cloud datasets are those datasets which are stored in or connected to Activeloop servers, their paths look like: hub://username/dataset_name.

add_creds_key(creds_key: str, managed: bool = False)

Adds a new creds key to the dataset. These keys are used for tensors that are linked to external data.

Examples

>>> # create/load a dataset
>>> ds = deeplake.dataset("hub://username/dataset")
>>> # add a new creds key
>>> ds.add_creds_key("my_s3_key")

Parameters

creds_key (str) – The key to be added.
managed (bool) – If True, the creds corresponding to the key will be fetched from activeloop platform. Note, this is only applicable for datasets that are connected to activeloop platform. Defaults to False.

property client: Returns the client of the dataset.

connect(*args, **kwargs)

Connect a Deep Lake cloud dataset through a deeplake path.

Examples

>>> # create/load an s3 dataset
>>> s3_ds = deeplake.dataset("s3://bucket/dataset")
>>> ds = s3_ds.connect(dest_path="hub://my_org/dataset", creds_key="my_managed_credentials_key", token="my_activeloop_token)
>>> # or
>>> ds = s3_ds.connect(org_id="my_org", creds_key="my_managed_credentials_key", token="my_activeloop_token")

Parameters

creds_key (str) – The managed credentials to be used for accessing the source path.
dest_path (str, optional) – The full path to where the connected Deep Lake dataset will reside. Can be: a Deep Lake path like hub://organization/dataset
org_id (str, optional) – The organization to where the connected Deep Lake dataset will be added.
ds_name (str, optional) – The name of the connected Deep Lake dataset. Will be infered from dest_path or src_path if not provided.
token (str, optional) – Activeloop token used to fetch the managed credentials.

Raises

InvalidSourcePathError – If the dataset’s path is not a valid s3, gcs or azure path.
InvalidDestinationPathError – If dest_path, or org_id and ds_name do not form a valid Deep Lake path.
TokenPermissionError – If the user does not have permission to create a dataset in the specified organization.

delete(large_ok=False)

Deletes the entire dataset from the cache layers (if any) and the underlying storage. This is an IRREVERSIBLE operation. Data once deleted can not be recovered.

Parameters

large_ok (bool) – Delete datasets larger than 1 GB. Defaults to False.

Raises

DatasetTooLargeToDelete – If the dataset is larger than 1 GB and large_ok is False.
DatasetHandlerError – If the dataset is marked as allow_delete=False.

get_managed_creds_keys() → Set[str]: Returns the set of creds keys added to the dataset that are managed by Activeloop platform. These are used to fetch external data in linked tensors.

property is_actually_cloud: bool: Datasets that are connected to Deep Lake cloud can still technically be stored anywhere. If a dataset is in Deep Lake cloud but stored without hub:// prefix, it should only be used for testing.

rename(path)

Renames the dataset to path.

Example

>>> ds = deeplake.load("hub://username/dataset")
>>> ds.rename("hub://username/renamed_dataset")

Parameters: path (str, pathlib.Path) – New path to the dataset.
Raises: RenameError – If path points to a different directory.

property token: Get attached token of the dataset

update_creds_key(creds_key: str, new_creds_key: Optional[str] = None, managed: Optional[bool] = None)

Updates the name and/or management status of a creds key.

Parameters

creds_key (str) – The key whose management status is to be changed.
new_creds_key (str, optional) – The new key to replace the old key. If not provided, the old key will be used.
managed (bool) – The target management status. If True, the creds corresponding to the key will be fetched from activeloop platform.

Raises

ValueError – If the dataset is not connected to activeloop platform.
ValueError – If both new_creds_key and managed are None.
KeyError – If the creds key is not present in the dataset.
Exception – All other errors such as during population of managed creds.

Examples

>>> # create/load a dataset
>>> ds = deeplake.dataset("path/to/dataset")
>>> # add a new creds key
>>> ds.add_creds_key("my_s3_key")
>>> # Populate the name added with creds dictionary
>>> # These creds are only present temporarily and will have to be repopulated on every reload
>>> ds.populate_creds("my_s3_key", {})
>>> # Rename the key and change the management status of the key to True. Before doing this, ensure that the creds have been created on activeloop platform
>>> # Now, this key will no longer use the credentials populated in the previous step but will instead fetch them from activeloop platform
>>> # These creds don't have to be populated again on every reload and will be fetched every time the dataset is loaded
>>> ds.update_creds_key("my_s3_key", "my_managed_key", True)

visualize(width: Optional[Union[int, str]] = None, height: Optional[Union[int, str]] = None)

Visualizes the dataset in the Jupyter notebook.

Parameters

width – Union[int, str, None] Optional width of the visualizer canvas.
height – Union[int, str, None] Optional height of the visualizer canvas.

Raises

Exception – If the dataset is not a Deep Lake cloud dataset and the visualization is attempted in colab.

ViewEntry

class deeplake.core.dataset.ViewEntry

Represents a view saved inside a dataset.

delete(): Deletes the view.

property id: str: Returns id of the view.

load(verbose=True)

Loads the view and returns the Dataset.

Parameters: verbose (bool) – If True, logs will be printed. Defaults to True.
Returns: Loaded dataset view.
Return type: Dataset

property message: str: Returns the message with which the view was saved.

optimize(tensors: Optional[List[str]] = None, unlink=True, num_workers=0, scheduler='threaded', progressbar=True)

Optimizes the dataset view by copying and rechunking the required data. This is necessary to achieve fast streaming speeds when training models using the dataset view. The optimization process will take some time, depending on the size of the data.

Example

>>> # save view
>>> ds[:10].save_view(id="first_10")
>>> # optimize view
>>> ds.get_view("first_10").optimize()
>>> # load optimized view
>>> ds.load_view("first_10")

Parameters

tensors (List[str]) – Tensors required in the optimized view. By default all tensors are copied.
unlink (bool) –
- If True, this unlinks linked tensors (if any) by copying data from the links to the view.
- This does not apply to linked videos. Set deeplake.constants._UNLINK_VIDEOS to True to change this behavior.
num_workers (int) – Number of workers to be used for the optimization process. Defaults to 0.
scheduler (str) – The scheduler to be used for optimization. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Only applicable if optimize=True. Defaults to ‘threaded’.
progressbar (bool) – Whether to display a progressbar.

Returns

ViewEntry

Raises

Exception – When query view cannot be optimized.