hub.core.dataset
Dataset
- class hub.core.dataset.Dataset
- add_creds_key(creds_key: str, managed: bool = False)
Adds a new creds key to the dataset. These keys are used for tensors that are linked to external data.
Examples
>>> # create/load a dataset >>> ds = hub.empty("path/to/dataset") >>> # add a new creds key >>> ds.add_creds_key("my_s3_key")
- Parameters
creds_key (str) – The key to be added.
managed (bool) –
If
True
, the creds corresponding to the key will be fetched from Activeloop platform.Defaults to
False
.
- Raises
ValueError – If the dataset is not connected to Activeloop platform and
managed
isTrue
.
Note
managed
parameter is applicable only for datasets that are connected to Activeloop platform.
- append(sample: Dict[str, Any], skip_ok: bool = False, append_empty: bool = False)
Append samples to mutliple tensors at once. This method expects all tensors being updated to be of the same length.
- Parameters
sample (dict) – Dictionary with tensor names as keys and samples as values.
skip_ok (bool) – Skip tensors not in
sample
if set toTrue
.append_empty (bool) – Append empty samples to tensors not specified in
sample
if set toTrue
. If True,skip_ok
is ignored.
- Raises
KeyError – If any tensor in the dataset is not a key in
sample
andskip_ok
isFalse
.TensorDoesNotExistError – If tensor in
sample
does not exist.ValueError – If all tensors being updated are not of the same length.
NotImplementedError – If an error occurs while writing tiles.
Exception – Error while attempting to rollback appends.
SampleAppendingError – Error that occurs when someone tries to append a tensor value directly to the dataset without specifying tensor name.
Examples
>>> ds = hub.empty("../test/test_ds") >>> ds.create_tensor('data') Tensor(key='data') >>> ds.create_tensor('labels') Tensor(key='labels') >>> ds.append({"data": [1, 2, 3, 4], "labels":[0, 1, 2, 3]})
- property branch: str
The current branch of the dataset
- property branches
Lists all the branches of the dataset.
- Returns
List of branches.
- change_creds_management(creds_key: str, managed: bool)
Changes the management status of the creds key.
- Parameters
creds_key (str) – The key whose management status is to be changed.
managed (bool) – The target management status. If
True
, the creds corresponding to the key will be fetched from activeloop platform.
- Raises
ValueError – If the dataset is not connected to activeloop platform.
KeyError – If the creds key is not present in the dataset.
Examples
>>> # create/load a dataset >>> ds = hub.dataset("path/to/dataset") >>> # add a new creds key >>> ds.add_creds_key("my_s3_key") >>> # Populate the name added with creds dictionary >>> # These creds are only present temporarily and will have to be repopulated on every reload >>> ds.populate_creds("my_s3_key", {}) >>> # Change the management status of the key to True. Before doing this, ensure that the creds have been created on activeloop platform >>> # Now, this key will no longer use the credentials populated in the previous step but will instead fetch them from activeloop platform >>> # These creds don't have to be populated again on every reload and will be fetched every time the dataset is loaded >>> ds.change_creds_management("my_s3_key", True)
- checkout(address: str, create: bool = False) Optional[str]
Checks out to a specific commit_id or branch. If
create = True
, creates a new branch with nameaddress
.- Parameters
address (str) – The commit_id or branch to checkout to.
create (bool) – If
True
, creates a new branch with name as address.
- Returns
The commit_id of the dataset after checkout.
- Return type
Optional[str]
- Raises
Exception – If the dataset is a filtered view.
Examples
>>> ds = hub.empty("../test/test_ds") >>> ds.create_tensor("abc") Tensor(key='abc') >>> ds.abc.append([1, 2, 3]) >>> first_commit = ds.commit() >>> ds.checkout("alt", create=True) 'firstdbf9474d461a19e9333c2fd19b46115348f' >>> ds.abc.append([4, 5, 6]) >>> ds.abc.numpy() array([[1, 2, 3], [4, 5, 6]]) >>> ds.checkout(first_commit) 'firstdbf9474d461a19e9333c2fd19b46115348f' >>> ds.abc.numpy() array([[1, 2, 3]])
Note
Checkout from a head node in any branch that contains uncommitted data will lead to an automatic commit before the checkout.
- clear_cache()
Flushes (see
Dataset.flush()
) the contents of the cache layers (if any) and then deletes contents of all the layers of it.This doesn’t delete data from the actual storage.
This is useful if you have multiple datasets with memory caches open, taking up too much RAM.
Also useful when local cache is no longer needed for certain datasets and is taking up storage space.
- property client
Returns the client of the dataset.
- commit(message: Optional[str] = None, allow_empty=False) str
Stores a snapshot of the current state of the dataset.
- Parameters
message (str, Optional) – Used to describe the commit.
allow_empty (bool) – If
True
, commit even if there are no changes.
- Returns
the commit id of the saved commit that can be used to access the snapshot.
- Return type
str
- Raises
Exception – If dataset is a filtered view.
EmptyCommitError – if there are no changes and user does not forced to commit unchanged data.
Note
Commiting from a non-head node in any branch, will lead to an automatic checkout to a new branch.
This same behaviour will happen if new samples are added or existing samples are updated from a non-head node.
- property commit_id: Optional[str]
The lasted committed commit id of the dataset. If there are no commits, this returns
None
.
- property commits: List[Dict]
Lists all the commits leading to the current dataset state.
- Returns
List of dictionaries containing commit information.
- copy(dest: Union[str, Path], tensors: Optional[List[str]] = None, overwrite: bool = False, creds=None, token=None, num_workers: int = 0, scheduler='threaded', progressbar=True, public: bool = False)
Copies this dataset or dataset view to
dest
. Version control history is not included.- Parameters
dest (str, pathlib.Path) – Destination dataset or path to copy to. If a Dataset instance is provided, it is expected to be empty.
tensors (List[str], optional) – Names of tensors (and groups) to be copied. If not specified all tensors are copied.
overwrite (bool) – If
True
and a dataset exists at destination, it will be overwritten. Defaults to False.creds (dict, Optional) – creds required to create / overwrite datasets at dest.
token (str, Optional) – token used to for fetching credentials to dest.
num_workers (int) – The number of workers to use for copying. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for copying. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar If
True
(default).public (bool) – Defines if the dataset will have public access. Applicable only if Hub cloud storage is used and a new Dataset is being created. Defaults to False.
- Returns
New dataset object.
- Return type
- Raises
DatasetHandlerError – If a dataset already exists at destination path and overwrite is False.
- create_group(name: str, exist_ok=False) Dataset
Creates a tensor group. Intermediate groups in the path are also created.
- Parameters
name – The name of the group to create.
exist_ok – If
True
, the group is created if it does not exist. IfFalse
, an error is raised if the group already exists. Defaults toFalse
.
- Returns
The created group.
- Raises
TensorGroupAlreadyExistsError – If the group already exists and
exist_ok
is False.
Examples
>>> ds.create_group("images") >>> ds['images'].create_tensor("cats")
>>> ds.create_groups("images/jpg/cats") >>> ds["images"].create_tensor("png") >>> ds["images/jpg"].create_group("dogs")
- create_tensor(name: str, htype: str = 'unspecified', dtype: Union[str, dtype] = 'unspecified', sample_compression: str = 'unspecified', chunk_compression: str = 'unspecified', hidden: bool = False, create_sample_info_tensor: bool = True, create_shape_tensor: bool = True, create_id_tensor: bool = True, verify: bool = False, exist_ok: bool = False, **kwargs)
Creates a new tensor in the dataset.
Examples
>>> # create dataset >>> ds = hub.dataset("path/to/dataset")
>>> # create tensors >>> ds.create_tensor("images", htype="image", sample_compression="jpg") >>> ds.create_tensor("videos", htype="video", sample_compression="mp4") >>> ds.create_tensor("data") >>> ds.create_tensor("point_clouds", htype="point_cloud")
>>> # append data >>> ds.images.append(np.ones((400, 400, 3), dtype='uint8')) >>> ds.videos.append(hub.read("videos/sample_video.mp4")) >>> ds.data.append(np.zeros((100, 100, 2)))
- Parameters
name (str) – The name of the tensor to be created.
htype (str) –
The class of data for the tensor.
The defaults for other parameters are determined in terms of this value.
For example,
htype="image"
would havedtype
default touint8
.These defaults can be overridden by explicitly passing any of the other parameters to this function.
May also modify the defaults for other parameters.
dtype (str) – Optionally override this tensor’s
dtype
. All subsequent samples are required to have thisdtype
.sample_compression (str) – All samples will be compressed in the provided format. If
None
, samples are uncompressed.chunk_compression (str) – All chunks will be compressed in the provided format. If
None
, chunks are uncompressed.hidden (bool) – If
True
, the tensor will be hidden from ds.tensors but can still be accessed viads[tensor_name]
.create_sample_info_tensor (bool) – If
True
, meta data of individual samples will be saved in a hidden tensor. This data can be accessed viatensor[i].sample_info
.create_shape_tensor (bool) – If
True
, an associated tensor containing shapes of each sample will be created.create_id_tensor (bool) – If
True
, an associated tensor containing unique ids for each sample will be created. This is useful for merge operations.verify (bool) – Valid only for link htypes. If
True
, all links will be verified before they are added to the tensor.exist_ok (bool) – If
True
, the group is created if it does not exist. ifFalse
, an error is raised if the group already exists.**kwargs –
htype
defaults can be overridden by passing any of the compatible parameters.To see all htypes and their correspondent arguments, check out Htypes.
- Returns
The new tensor, which can be accessed by
dataset[name]
ordataset.name
.- Return type
- Raises
TensorAlreadyExistsError – If the tensor already exists and
exist_ok
isFalse
.TensorGroupAlreadyExistsError – Duplicate tensor groups are not allowed.
InvalidTensorNameError – If
name
is in dataset attributes.NotImplementedError – If trying to override
chunk_compression
.TensorMetaInvalidHtype – If invalid htype is specified.
ValueError – If an illegal argument is specified.
- create_tensor_like(name: str, source: Tensor, unlink: bool = False) Tensor
Copies the
source
tensor’s meta information and creates a new tensor with it. No samples are copied, only the meta/info for the tensor is.Examples
>>> ds.create_tensor_like("cats", ds["images"])
- delete(large_ok=False)
Deletes the entire dataset from the cache layers (if any) and the underlying storage. This is an IRREVERSIBLE operation. Data once deleted can not be recovered.
- Parameters
large_ok (bool) – Delete datasets larger than 1 GB. Defaults to
False
.
- delete_group(name: str, large_ok: bool = False)
Delete a tensor group from the dataset.
Examples
>>> ds.delete_group("images/dogs")
- Parameters
name (str) – The name of tensor group to be deleted.
large_ok (bool) – Delete tensor groups larger than 1 GB. Disabled by default.
- Returns
None
- Raises
TensorGroupDoesNotExistError – If tensor group of name
name
does not exist in the dataset.
- delete_tensor(name: str, large_ok: bool = False)
Delete a tensor from the dataset.
Examples
>>> ds.delete_tensor("images/cats")
- Parameters
name (str) – The name of tensor to be deleted.
large_ok (bool) – Delete tensors larger than 1 GB. Disabled by default.
- Returns
None
- Raises
TensorDoesNotExistError – If tensor of name
name
does not exist in the dataset.
- delete_view(id: str)
Deletes the view with given view id.
- Parameters
id (str) – Id of the view to delete.
- Raises
KeyError – if view with given id does not exist.
- diff(id_1: Optional[str] = None, id_2: Optional[str] = None, as_dict=False) Optional[Dict]
Returns/displays the differences between commits/branches.
For each tensor this contains information about the sample indexes that were added/modified as well as whether the tensor was created.
- Parameters
id_1 (str, Optional) – The first commit_id or branch name.
id_2 (str, Optional) – The second commit_id or branch name.
as_dict (bool, Optional) – If
True
, returns a dictionary of the differences instead of printing them. This dictionary will have two keys - “tensor” and “dataset” which represents tensor level and dataset level changes, respectively. Defaults to False.
- Returns
Optional[Dict]
- Raises
ValueError – If
id_1
is None andid_2
is not None.
Note
If both
id_1
andid_2
are None, the differences between the current state and the previous commit will be calculated. If you’re at the head of the branch, this will show the uncommitted changes, if any.If only
id_1
is provided, the differences between the current state and id_1 will be calculated. If you’re at the head of the branch, this will take into account the uncommitted changes, if any.If only
id_2
is provided, a ValueError will be raised.If both
id_1
andid_2
are provided, the differences betweenid_1
andid_2
will be calculated.
Note
A dictionary of the differences between the commits/branches is returned if
as_dict
isTrue
.If
id_1
andid_2
are None, a dictionary containing the differences between the current state and the previous commit will be returned.If only
id_1
is provided, a dictionary containing the differences in the current state andid_1
respectively will be returned.If only
id_2
is provided, a ValueError will be raised.If both
id_1
andid_2
are provided, a dictionary containing the differences inid_1
andid_2
respectively will be returned.
None
is returned ifas_dict
isFalse
.Example of a dict returned:
>>> { ... "image": {"data_added": [3, 6], "data_updated": {0, 2}, "created": False, "info_updated": False, "data_transformed_in_place": False}, ... "label": {"data_added": [0, 3], "data_updated": {}, "created": True, "info_updated": False, "data_transformed_in_place": False}, ... "other/stuff" : {"data_added": [3, 3], "data_updated": {1, 2}, "created": True, "info_updated": False, "data_transformed_in_place": False}, ... }
Here, “data_added” is a range of sample indexes that were added to the tensor.
For example [3, 6] means that sample 3, 4 and 5 were added.
Another example [3, 3] means that no samples were added as the range is empty.
“data_updated” is a set of sample indexes that were updated.
For example {0, 2} means that sample 0 and 2 were updated.
“created” is a boolean that is
True
if the tensor was created.“info_updated” is a boolean that is
True
if the info of the tensor was updated.“data_transformed_in_place” is a boolean that is
True
if the data of the tensor was transformed in place.
- extend(samples: Dict[str, Any], skip_ok: bool = False)
Appends multiple rows of samples to mutliple tensors at once. This method expects all tensors being updated to be of the same length.
- Parameters
samples (Dict[str, Any]) – Dictionary with tensor names as keys and samples as values.
skip_ok (bool) – Skip tensors not in
samples
if set to True.
- Raises
KeyError – If any tensor in the dataset is not a key in
samples
andskip_ok
isFalse
.TensorDoesNotExistError – If tensor in
samples
does not exist.ValueError – If all tensors being updated are not of the same length.
NotImplementedError – If an error occurs while writing tiles.
Exception – Error while attempting to rollback appends.
- filter(function: Union[Callable, str], num_workers: int = 0, scheduler: str = 'threaded', progressbar: bool = True, save_result: bool = False, result_path: Optional[str] = None, result_ds_args: Optional[dict] = None)
Filters the dataset in accordance of filter function
f(x: sample) -> bool
- Parameters
function (Callable, str) – Filter function that takes sample as argument and returns
True
/False
if sample should be included in result. Also supports simplified expression evaluations. Seehub.core.query.query.DatasetQuery
for more details.num_workers (int) – Level of parallelization of filter evaluations. 0 indicates in-place for-loop evaluation, multiprocessing is used otherwise.
scheduler (str) – Scheduler to use for multiprocessing evaluation. “threaded” is default.
progressbar (bool) – Display progress bar while filtering.
True
is default.save_result (bool) – If
True
, result of the filter will be saved to a dataset asynchronously.result_path (Optional, str) – Path to save the filter result. Only applicable if
save_result
is True.result_ds_args (Optional, dict) – Additional args for result dataset. Only applicable if
save_result
is True.
- Returns
View of Dataset with elements that satisfy filter function.
Example
Following filters are identical and return dataset view where all the samples have label equals to 2.
>>> dataset.filter(lambda sample: sample.labels.numpy() == 2) >>> dataset.filter('labels == 2')
- flush()
Necessary operation after writes if caches are being used. Writes all the dirty data from the cache layers (if any) to the underlying storage. Here dirty data corresponds to data that has been changed/assigned and but hasn’t yet been sent to the underlying storage.
- get_commit_details(commit_id) Dict
Get details of a particular commit.
- Parameters
commit_id (str) – commit id of the commit.
- Returns
Dictionary of details with keys -
commit
,author
,time
,message
.- Return type
Dict
- Raises
KeyError – If given
commit_id
is was not found in the dataset.
- get_creds_keys() List[str]
Returns the list of creds keys added to the dataset. These are used to fetch external data in linked tensors
- get_view(id: str) ViewEntry
Returns the dataset view corresponding to
id
.Examples
>>> # save view >>> ds[:100].save_view(id="first_100") >>> # load view >>> first_100 = ds.get_view("first_100").load() >>> # 100 >>> print(len(first_100))
See
Dataset.save_view()
to learn more about saving views.- Parameters
id (str) – id of required view.
- Returns
ViewEntry
- Raises
KeyError – If no such view exists.
- get_views(commit_id: Optional[str] = None) List[ViewEntry]
Returns list of views stored in this Dataset.
- property has_head_changes
Returns True if currently at head node and uncommitted changes are present.
- property info
Returns the information about the dataset.
- property is_view: bool
Returns
True
if this dataset is a view andFalse
otherwise.
- load_view(id: str, optimize: Optional[bool] = False, num_workers: int = 0, scheduler: str = 'threaded', progressbar: Optional[bool] = True)
Loads the view and returns the
Dataset
by id. Equivalent to ds.get_view(id).load().- Parameters
id (str) – id of the view to be loaded.
optimize (bool) – If
True
, the dataset view is optimized by copying and rechunking the required data before loading. This is necessary to achieve fast streaming speeds when training models using the dataset view. The optimization process will take some time, depending on the size of the data.num_workers (int) – Number of workers to be used for the optimization process. Only applicable if optimize=True. Defaults to 0.
scheduler (str) – The scheduler to be used for optimization. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Only applicable if optimize=True. Defaults to ‘threaded’.
progressbar (bool) – Whether to use progressbar for optimization. Only applicable if optimize=True. Defaults to True.
- Returns
The loaded view.
- Return type
- Raises
KeyError – if view with given id does not exist.
- log()
Displays the details of all the past commits.
- property max_len
Return the maximum length of the tensor
- merge(target_id: str, conflict_resolution: Optional[str] = None, delete_removed_tensors: bool = False, force: bool = False)
Merges the target_id into the current dataset.
- Parameters
target_id (str) – The commit_id or branch to merge.
conflict_resolution (str, Optional) –
The strategy to use to resolve merge conflicts.
Conflicts are scenarios where both the current dataset and the target id have made changes to the same sample/s since their common ancestor.
- Must be one of the following
None - this is the default value, will raise an exception if there are conflicts.
”ours” - during conflicts, values from the current dataset will be used.
”theirs” - during conflicts, values from target id will be used.
delete_removed_tensors (bool) – If
True
, deleted tensors will be deleted from the dataset.force (bool) –
Forces merge.
force=True
will have these effects in the following cases of merge conflicts:If tensor is renamed on target but is missing from HEAD, renamed tensor will be registered as a new tensor on current branch.
If tensor is renamed on both target and current branch, tensor on target will be registered as a new tensor on current branch.
If tensor is renamed on target and a new tensor of the new name was created on the current branch, they will be merged.
- Raises
Exception – if dataset is a filtered view.
ValueError – if the conflict resolution strategy is not one of the None, “ours”, or “theirs”.
- property meta: DatasetMeta
Returns the metadata of the dataset.
- property min_len
Return the minimum length of the tensor
- property num_samples: int
Returns the length of the smallest tensor. Ignores any applied indexing and returns the total length.
- property parent
Returns the parent of this group. Returns None if this is the root dataset.
- property pending_commit_id: str
The commit_id of the next commit that will be made to the dataset. If you’re not at the head of the current branch, this will be the same as the commit_id.
- pop(index: Optional[int] = None)
Removes a sample from all the tensors of the dataset. For any tensor if the index >= len(tensor), the sample won’t be popped from it.
- Parameters
index (int, Optional) – The index of the sample to be removed. If it is
None
, the index becomes thelength of the longest tensor - 1
.- Raises
IndexError – If the index is out of range.
- populate_creds(creds_key: str, creds: dict)
Populates the creds key added in add_creds_key with the given creds. These creds are used to fetch the external data. This needs to be done everytime the dataset is reloaded for datasets that contain links to external data.
Examples
>>> # create/load a dataset >>> ds = hub.dataset("path/to/dataset") >>> # add a new creds key >>> ds.add_creds_key("my_s3_key") >>> # populate the creds >>> ds.populate_creds("my_s3_key", {"aws_access_key_id": "my_access_key", "aws_secret_access_key": "my_secret_key"})
- pytorch(transform: Optional[Callable] = None, tensors: Optional[Sequence[str]] = None, tobytes: Union[bool, Sequence[str]] = False, num_workers: int = 1, batch_size: int = 1, drop_last: bool = False, collate_fn: Optional[Callable] = None, pin_memory: bool = False, shuffle: bool = False, buffer_size: int = 2048, use_local_cache: bool = False, use_progress_bar: bool = False, return_index: bool = True, pad_tensors: bool = False)
Converts the dataset into a pytorch Dataloader.
- Parameters
transform (Callable, Optional) – Transformation function to be applied to each sample.
tensors (List, Optional) – Optionally provide a list of tensor names in the ordering that your training script expects. For example, if you have a dataset that has “image” and “label” tensors, if tensors=[“image”, “label”], your training script should expect each batch will be provided as a tuple of (image, label).
tobytes (bool) – If
True
, samples will not be decompressed and their raw bytes will be returned instead of numpy arrays. Can also be a list of tensors, in which case those tensors alone will not be decompressed.num_workers (int) – The number of workers to use for fetching data in parallel.
batch_size (int) – Number of samples per batch to load. Default value is 1.
drop_last (bool) – Set to True to drop the last incomplete batch, if the dataset size is not divisible by the batch size. if
False
and the size of dataset is not divisible by the batch size, then the last batch will be smaller. Default value is False. Read torch.utils.data.DataLoader docs for more details.collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s). Used when using batched loading from a map-style dataset. Read torch.utils.data.DataLoader docs for more details.
pin_memory (bool) – If
True
, the data loader will copy Tensors into CUDA pinned memory before returning them. Default value is False. Read torch.utils.data.DataLoader docs for more details.shuffle (bool) – If
True
, the data loader will shuffle the data indices. Default value is False. Details about how hub shuffles data can be found at https://docs.activeloop.ai/how-hub-works/shuffling-in-ds.pytorchbuffer_size (int) – The size of the buffer used to shuffle the data in MBs. Defaults to 2048 MB. Increasing the buffer_size will increase the extent of shuffling.
use_local_cache (bool) – If
True
, the data loader will use a local cache to store data. This is useful when the dataset can fit on the machine and we don’t want to fetch the data multiple times for each iteration. Default value is False.use_progress_bar (bool) – If
True
, tqdm will be wrapped around the returned dataloader. Default value is True.return_index (bool) – If
True
, the returned dataloader will have a key “index” that contains the index of the sample(s) in the original dataset. Default value is True.pad_tensors (bool) – If
True
, shorter tensors will be padded to the length of the longest tensor. Default value is False.
- Returns
A torch.utils.data.DataLoader object.
- Raises
EmptyTensorError – If one or more tensors being passed to pytorch are empty.
Note
Pytorch does not support uint16, uint32, uint64 dtypes. These are implicitly type casted to int32, int64 and int64 respectively. This spins up it’s own workers to fetch data.
- rechunk(tensors: Optional[Union[str, List[str]]] = None, num_workers: int = 0, scheduler: str = 'threaded', progressbar: bool = True)
Rewrites the underlying chunks to make their sizes optimal. This is usually needed in cases where a lot of updates have been made to the data.
- Parameters
tensors (str, List[str], Optional) – Name/names of the tensors to rechunk. If None, all tensors in the dataset are rechunked.
num_workers (int) – The number of workers to use for rechunking. Defaults to 0. When set to 0, it will always use serial processing, irrespective of the scheduler.
scheduler (str) – The scheduler to be used for rechunking. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Defaults to ‘threaded’.
progressbar (bool) – Displays a progress bar If
True
(default).
- rename(path: Union[str, Path])
Renames the dataset to path.
Example
>>> ds = hub.load("hub://username/dataset") >>> ds.rename("hub://username/renamed_dataset")
- Parameters
path (str, pathlib.Path) – New path to the dataset.
- Raises
RenameError – If
path
points to a different directory.
- rename_group(name: str, new_name: str) None
Renames group with name
name
tonew_name
- Parameters
name (str) – Name of group to be renamed.
new_name (str) – New name of group.
- Raises
TensorGroupDoesNotExistError – If tensor group of name
name
does not exist in the dataset.TensorAlreadyExistsError – Duplicate tensors are not allowed.
TensorGroupAlreadyExistsError – Duplicate tensor groups are not allowed.
InvalidTensorGroupNameError – If
name
is in dataset attributes.RenameError – If
new_name
points to a group different fromname
.
- rename_tensor(name: str, new_name: str) Tensor
Renames tensor with name
name
tonew_name
- Parameters
name (str) – Name of tensor to be renamed.
new_name (str) – New name of tensor.
- Returns
Renamed tensor.
- Return type
- Raises
TensorDoesNotExistError – If tensor of name
name
does not exist in the dataset.TensorAlreadyExistsError – Duplicate tensors are not allowed.
TensorGroupAlreadyExistsError – Duplicate tensor groups are not allowed.
InvalidTensorNameError – If
new_name
is in dataset attributes.RenameError – If
new_name
points to a group different fromname
.
- reset()
Resets the uncommitted changes present in the branch.
Note
The uncommitted data is deleted from underlying storage, this is not a reversible operation.
- property root
Returns the root dataset of a group.
- property sample_indices
Returns all the indices pointed to by this dataset view.
- save_view(message: Optional[str] = None, path: Optional[Union[str, Path]] = None, id: Optional[str] = None, optimize: bool = False, num_workers: int = 0, scheduler: str = 'threaded', verbose: bool = True, **ds_args) str
Saves a dataset view as a virtual dataset (VDS)
Examples
>>> # Save to specified path >>> vds_path = ds[:10].save_view(path="views/first_10", id="first_10") >>> vds_path views/first_10
>>> # Path unspecified >>> vds_path = ds[:100].save_view(id="first_100", message="first 100 samples") >>> # vds_path = path/to/dataset
>>> # Random id >>> vds_path = ds[:100].save_view() >>> # vds_path = path/to/dataset/.queries/92f41922ed0471ec2d27690b7351fc96bea060e6c5ee22b14f7ffa5f291aa068
See
Dataset.get_view()
to learn how to load views by id. These virtual datasets can also be loaded from their path like normal datasets.- Parameters
message (Optional, str) – Custom user message.
path (Optional, str, pathlib.Path) –
The VDS will be saved as a standalone dataset at the specified path.
If not specified, the VDS is saved under
.queries
subdirectory of the source dataset’s storage.If the user doesn’t have write access to the source dataset and the source dataset is a hub cloud dataset, then the VDS is saved is saved under the user’s hub account and can be accessed using
hub.load(f"hub://{username}/queries/{query_hash}")
.
id (Optional, str) – Unique id for this view. Random id will be generated if not specified.
optimize (bool) –
If
True
, the dataset view will be optimized by copying and rechunking the required data. This is necessary to achieve fast streaming speeds when training models using the dataset view. The optimization process will take some time, depending on the size of the data.You can also choose to optimize the saved view later by calling its
ViewEntry.optimize()
method.
num_workers (int) – Number of workers to be used for optimization process. Applicable only if
optimize=True
. Defaults to 0.scheduler (str) – The scheduler to be used for optimization. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Only applicable if
optimize=True
. Defaults to ‘threaded’.verbose (bool) – If
True
, logs will be printed. Defaults toTrue
.ds_args (dict) – Additional args for creating VDS when path is specified. (See documentation for
hub.dataset()
)
- Returns
Path to the saved VDS.
- Return type
str
- Raises
ReadOnlyModeError – When attempting to save a view inplace and the user doesn’t have write access.
DatasetViewSavingError – If HEAD node has uncommitted changes.
Note
Specifying
path
makes the view external. External views cannot be accessed using the parent dataset’sDataset.get_view()
,Dataset.load_view()
,Dataset.delete_view()
methods. They have to be loaded usinghub.load()
.
- size_approx()
Estimates the size in bytes of the dataset. Includes only content, so will generally return an under-estimate.
- summary()
Prints a summary of the dataset.
- tensorflow(tensors: Optional[Sequence[str]] = None, tobytes: Union[bool, Sequence[str]] = False)
Converts the dataset into a tensorflow compatible format.
See https://www.tensorflow.org/api_docs/python/tf/data/Dataset
- Parameters
tensors (List, Optional) – Optionally provide a list of tensor names in the ordering that your training script expects. For example, if you have a dataset that has “image” and “label” tensors, if
tensors=["image", "label"]
, your training script should expect each batch will be provided as a tuple of (image, label).tobytes (bool) – If
True
, samples will not be decompressed and their raw bytes will be returned instead of numpy arrays. Can also be a list of tensors, in which case those tensors alone will not be decompressed.
- Returns
tf.data.Dataset object that can be used for tensorflow training.
- property tensors: Dict[str, Tensor]
All tensors belonging to this group, including those within sub groups. Always returns the sliced tensors.
- property token
Get attached token of the dataset
- update_creds_key(old_creds_key: str, new_creds_key: str)
Replaces the old creds key with the new creds key. This is used to replace the creds key used for external data.
- visualize(width: Optional[Union[int, str]] = None, height: Optional[Union[int, str]] = None)
Visualizes the dataset in the Jupyter notebook.
- Parameters
width – Union[int, str, None] Optional width of the visualizer canvas.
height – Union[int, str, None] Optional height of the visualizer canvas.
- Raises
Exception – If the dataset is not a hub cloud dataset and the visualization is attempted in colab.
HubCloudDataset
- class hub.core.dataset.HubCloudDataset
Bases:
Dataset
Subclass of
Dataset
. Hub cloud datasets are those datasets which are stored on Activeloop servers, their paths look like:hub://username/dataset_name
.- add_creds_key(creds_key: str, managed: bool = False)
Adds a new creds key to the dataset. These keys are used for tensors that are linked to external data.
Examples
>>> # create/load a dataset >>> ds = hub.dataset("hub://username/dataset") >>> # add a new creds key >>> ds.add_creds_key("my_s3_key")
- Parameters
creds_key (str) – The key to be added.
managed (bool) – If
True
, the creds corresponding to the key will be fetched from activeloop platform. Note, this is only applicable for datasets that are connected to activeloop platform. Defaults toFalse
.
- change_creds_management(creds_key: str, managed: bool)
Changes the management status of the creds key.
- Parameters
creds_key (str) – The key whose management status is to be changed.
managed (bool) – The target management status. If
True
, the creds corresponding to the key will be fetched from activeloop platform.
- Raises
ValueError – If the dataset is not connected to activeloop platform.
KeyError – If the creds key is not present in the dataset.
Examples
>>> # create/load a dataset >>> ds = hub.dataset("hub://username/dataset") >>> # add a new creds key >>> ds.add_creds_key("my_s3_key") >>> # Populate the name added with creds dictionary >>> # These creds are only present temporarily and will have to be repopulated on every reload >>> ds.populate_creds("my_s3_key", {}) >>> # Change the management status of the key to True. Before doing this, ensure that the creds have been created on activeloop platform >>> # Now, this key will no longer use the credentials populated in the previous step but will instead fetch them from activeloop platform >>> # These creds don't have to be populated again on every reload and will be fetched every time the dataset is loaded >>> ds.change_creds_management("my_s3_key", True)
- property client
Returns the client of the dataset.
- delete(large_ok=False)
Deletes the entire dataset from the cache layers (if any) and the underlying storage. This is an IRREVERSIBLE operation. Data once deleted can not be recovered.
- Parameters
large_ok (bool) – Delete datasets larger than 1 GB. Defaults to
False
.
- property is_actually_cloud: bool
Datasets that are connected to hub cloud can still technically be stored anywhere. If a dataset is hub cloud but stored without
hub://
prefix, it should only be used for testing.
- rename(path)
Renames the dataset to path.
Example
>>> ds = hub.load("hub://username/dataset") >>> ds.rename("hub://username/renamed_dataset")
- Parameters
path (str, pathlib.Path) – New path to the dataset.
- Raises
RenameError – If
path
points to a different directory.
- property token
Get attached token of the dataset
- update_creds_key(old_creds_key: str, new_creds_key: str)
Replaces the old creds key with the new creds key. This is used to replace the creds key used for external data.
- visualize(width: Optional[Union[int, str]] = None, height: Optional[Union[int, str]] = None)
Visualizes the dataset in the Jupyter notebook.
- Parameters
width – Union[int, str, None] Optional width of the visualizer canvas.
height – Union[int, str, None] Optional height of the visualizer canvas.
- Raises
Exception – If the dataset is not a hub cloud dataset and the visualization is attempted in colab.
ViewEntry
- class hub.core.dataset.ViewEntry
Represents a view saved inside a dataset.
- delete()
Deletes the view.
- property id: str
Returns id of the view.
- load(verbose=True)
Loads the view and returns the
Dataset
.- Parameters
verbose (bool) – If
True
, logs will be printed. Defaults toTrue
.- Returns
Loaded dataset view.
- Return type
- property message: str
Returns the message with which the view was saved.
- optimize(unlink=True, num_workers=0, scheduler='threaded', progressbar=True)
Optimizes the dataset view by copying and rechunking the required data. This is necessary to achieve fast streaming speeds when training models using the dataset view. The optimization process will take some time, depending on the size of the data.
Example
>>> # save view >>> ds[:10].save_view(view_id="first_10") >>> # optimize view >>> ds.get_view("first_10").optimize() >>> # load optimized view >>> ds.load_view("first_10")
- Parameters
unlink (bool) –
If
True
, this unlinks linked tensors (if any) by copying data from the links to the view.This does not apply to linked videos. Set
hub.constants._UNLINK_VIDEOS
toTrue
to change this behavior.
num_workers (int) – Number of workers to be used for the optimization process. Defaults to 0.
scheduler (str) – The scheduler to be used for optimization. Supported values include: ‘serial’, ‘threaded’, ‘processed’ and ‘ray’. Only applicable if
optimize=True
. Defaults to ‘threaded’.progressbar (bool) – Whether to display a progressbar.
- Returns