Datasets

Creating Datasets

`deeplake.dataset`	Returns a `Dataset` object referencing either a new or existing dataset.
`deeplake.empty`	Creates an empty dataset
`deeplake.like`	Creates a new dataset by copying the `source` dataset's structure to a new location.
`deeplake.ingest_classification`	Ingest a dataset of images from a local folder to a Deep Lake Dataset.
`deeplake.ingest_coco`	Ingest images and annotations in COCO format to a Deep Lake Dataset.
`deeplake.ingest_yolo`	Ingest images and annotations (bounding boxes or polygons) in YOLO format to a Deep Lake Dataset.
`deeplake.ingest_kaggle`	Download and ingest a kaggle dataset and store it as a structured dataset to destination.
`deeplake.ingest_dataframe`	Convert pandas dataframe to a Deep Lake Dataset.
`deeplake.ingest_huggingface`	Converts Hugging Face datasets to Deep Lake format.

Loading Datasets

deeplake.load

Loads an existing dataset

Deleting and Renaming Datasets

`deeplake.delete`	Deletes a dataset at a given path.
`deeplake.rename`	Renames dataset at `old_path` to `new_path`.

Copying Datasets

`deeplake.copy`	Copies dataset at `src` to `dest`.
`deeplake.deepcopy`	Copies dataset at `src` to `dest` including version control history.

Dataset Operations

`Dataset.summary`	Prints a summary of the dataset.
`Dataset.append`	Append samples to mutliple tensors at once.
`Dataset.extend`	Appends multiple rows of samples to mutliple tensors at once.
`Dataset.update`	Update existing samples in the dataset with new values.
`Dataset.query`	Returns a sliced `Dataset` with given query results.
`Dataset.copy`	Copies this dataset or dataset view to `dest`.
`Dataset.delete`	Deletes the entire dataset from the cache layers (if any) and the underlying storage.
`Dataset.rename`	Renames the dataset to path.
`Dataset.connect`	Connect a Deep Lake cloud dataset through a deeplake path.
`Dataset.visualize`	Visualizes the dataset in the Jupyter notebook.
`Dataset.pop`	Removes a sample from all the tensors of the dataset.
`Dataset.rechunk`	Rewrites the underlying chunks to make their sizes optimal.
`Dataset.flush`	Necessary operation after writes if caches are being used.
`Dataset.clear_cache`	Flushes (see `Dataset.flush()`) the contents of the cache layers (if any) and then deletes contents of all the layers of it.
`Dataset.size_approx`	Estimates the size in bytes of the dataset.
`Dataset.random_split`	Splits the dataset into non-overlapping `Dataset` objects of given lengths.

Dataset Visualization

Dataset.visualize

Visualizes the dataset in the Jupyter notebook.

Dataset Credentials

`Dataset.add_creds_key`	Adds a new creds key to the dataset.
`Dataset.populate_creds`	Populates the creds key added in add_creds_key with the given creds.
`Dataset.update_creds_key`	Updates the name and/or management status of a creds key.
`Dataset.get_creds_keys`	Returns the set of creds keys added to the dataset.

Dataset Properties

`Dataset.tensors`	All tensors belonging to this group, including those within sub groups.
`Dataset.groups`	All sub groups in this group
`Dataset.num_samples`	Returns the length of the smallest tensor.
`Dataset.read_only`	Returns True if dataset is in read-only mode and False otherwise.
`Dataset.info`	Returns the information about the dataset.
`Dataset.max_len`	Return the maximum length of the tensor.
`Dataset.min_len`	Return the minimum length of the tensor.

Dataset Version Control

`Dataset.commit`	Stores a snapshot of the current state of the dataset.
`Dataset.diff`	Returns/displays the differences between commits/branches.
`Dataset.checkout`	Checks out to a specific commit_id or branch.
`Dataset.merge`	Merges the target_id into the current dataset.
`Dataset.log`	Displays the details of all the past commits.
`Dataset.reset`	Resets the uncommitted changes present in the branch.
`Dataset.get_commit_details`	Get details of a particular commit.
`Dataset.commit_id`	The lasted committed commit id of the dataset.
`Dataset.branch`	The current branch of the dataset
`Dataset.pending_commit_id`	The commit_id of the next commit that will be made to the dataset.
`Dataset.has_head_changes`	Returns True if currently at head node and uncommitted changes are present.
`Dataset.commits`	Lists all the commits leading to the current dataset state.
`Dataset.branches`	Lists all the branches of the dataset.

Dataset Views

A dataset view is a subset of a dataset that points to specific samples (indices) in an existing dataset. Dataset views can be created by indexing a dataset, filtering a dataset with Dataset.filter(), querying a dataset with Dataset.query() or by sampling a dataset with Dataset.sample_by(). Filtering is done with user-defined functions or simplified expressions whereas query can perform SQL-like queries with our Tensor Query Language. See the full TQL spec here.

Dataset views can only be saved when a dataset has been committed and has no changes on the HEAD node, in order to preserve data lineage and prevent the underlying data from changing after the query or filter conditions have been evaluated.

Example

>>> import deeplake
>>> # load dataset
>>> ds = deeplake.load("hub://activeloop/mnist-train")
>>> # filter dataset
>>> zeros = ds.filter("labels == 0")
>>> # save view
>>> zeros.save_view(id="zeros")
>>> # load_view
>>> zeros = ds.load_view(id="zeros")
>>> len(zeros)
5923

`Dataset.query`	Returns a sliced `Dataset` with given query results.
`Dataset.sample_by`	Returns a sliced `Dataset` with given weighted sampler applied.
`Dataset.filter`	Filters the dataset in accordance of filter function `f(x: sample) -> bool`
`Dataset.save_view`	Saves a dataset view as a virtual dataset (VDS)
`Dataset.get_view`	Returns the dataset view corresponding to `id`.
`Dataset.load_view`	Loads the view and returns the `Dataset` by id.
`Dataset.delete_view`	Deletes the view with given view id.
`Dataset.get_views`	Returns list of views stored in this Dataset.
`Dataset.is_view`	Returns `True` if this dataset is a view and `False` otherwise.
`Dataset.min_view`	Returns a view of the dataset in which all tensors are sliced to have the same length as the shortest tensor.
`Dataset.max_view`	Returns a view of the dataset in which shorter tensors are padded with `None` s to have the same length as the longest tensor.