Datasets

Creating Datasets

deeplake.empty

Creates an empty Deep Lake dataset.

deeplake.like

Creates a new dataset by copying the source dataset's structure to a new location.

deeplake.ingest_classification

Ingest a dataset of images from a local folder to a Deep Lake Dataset.

deeplake.ingest_coco

Ingest images and annotations in COCO format to a Deep Lake Dataset.

deeplake.ingest_yolo

Ingest images and annotations (bounding boxes or polygons) in YOLO format to a Deep Lake Dataset.

deeplake.ingest_kaggle

Download and ingest a kaggle dataset and store it as a structured dataset to destination.

deeplake.ingest_dataframe

Convert pandas dataframe to a Deep Lake Dataset.

deeplake.ingest_huggingface

Converts Hugging Face datasets to Deep Lake format.

deeplake.dataset

Returns a Dataset object referencing either a new or existing dataset.

Loading Datasets

deeplake.load

Loads an existing Deep Lake dataset.

Deleting and Renaming Datasets

deeplake.delete

Deletes a dataset at a given path.

deeplake.rename

Renames managed dataset at path to new_name.

Copying Datasets

deeplake.copy

Copies dataset at src to dest.

deeplake.deepcopy

Copies dataset at src to dest including version control history.

Dataset Operations

Dataset.summary

Prints a summary of the dataset, including the tensor names and their lengths, shapes, htypes, dtypes, compressions, and other relevant information.

Dataset.append

Append a single sample (row) to multiple tensors at once.

Dataset.extend

Appends multiple samples (rows) to multiple tensors at once.

Dataset.update

Update existing samples in the dataset with new values.

Dataset.query

Returns a sliced Dataset with given query results.

Dataset.copy

Copies this dataset or dataset view to dest.

Dataset.delete

Deletes the entire dataset from the underlying storage and cache layers (if any).

Dataset.rename

Renames the dataset to new_name.

Dataset.connect

Connect a Deep Lake dataset stored in your cloud to the Deep Lake App.

Dataset.visualize

Visualizes the dataset in the Jupyter notebook.

Dataset.pop

Removes a sample from all the tensors of the dataset.

Dataset.rechunk

Rewrites the underlying chunks to make their sizes optimal.

Dataset.flush

Writes all the data that has been changed/assigned from the cache layers (if any) to the underlying storage.

Dataset.clear_cache

Flushes (see Dataset.flush()) the contents of the cache layers (if any) and then deletes contents of all the layers of it.

Dataset.size_approx

Estimates the size in bytes of the dataset.

Dataset.random_split

Splits the dataset into non-overlapping Dataset objects of given lengths.

Dataset Visualization

Dataset.visualize

Visualizes the dataset in the Jupyter notebook.

Dataset Credentials

Dataset.add_creds_key

Adds a new creds key to the dataset.

Dataset.populate_creds

Populates the creds key added in add_creds_key with the given creds.

Dataset.update_creds_key

Updates the name and/or management status of a creds key.

Dataset.get_creds_keys

Returns the set of creds keys added to the dataset.

Dataset Properties

Dataset.tensors

All tensors belonging to this group, including those within sub groups.

Dataset.groups

All sub groups in this group

Dataset.num_samples

Returns the length of the smallest tensor.

Dataset.read_only

Returns True if dataset is in read-only mode and False otherwise.

Dataset.info

Returns the information about the dataset.

Dataset.max_len

Returns the length (number of rows) of the longest tensor in the dataset.

Dataset.min_len

Returns the length (number of rows) of the shortest tensor in the dataset.

Dataset Version Control

Dataset.commit

Stores a snapshot of the current state of the dataset.

Dataset.diff

Returns/displays the differences between commits/branches.

Dataset.checkout

Checks out to a specific commit_id or branch.

Dataset.merge

Merges the target_id into the current dataset.

Dataset.log

Displays the details of all the past commits.

Dataset.reset

Resets the uncommitted changes present in the branch.

Dataset.get_commit_details

Get details of a particular commit.

Dataset.commit_id

The lasted committed commit id of the dataset.

Dataset.branch

The current branch of the dataset

Dataset.pending_commit_id

The commit_id of the next commit that will be made to the dataset.

Dataset.has_head_changes

Returns True if currently at head node and uncommitted changes are present.

Dataset.commits

Lists all the commits leading to the current dataset state.

Dataset.branches

Lists all the branches of the dataset.

Dataset Views

A dataset view is a subset of a dataset that points to specific samples (indices) in an existing dataset. Dataset views can be created by indexing a dataset, filtering a dataset with Dataset.filter(), querying a dataset with Dataset.query() or by sampling a dataset with Dataset.sample_by(). Filtering is done with user-defined functions or simplified expressions whereas query can perform SQL-like queries with our Tensor Query Language. See the full TQL spec here.

Dataset views can only be saved when a dataset has been committed and has no changes on the HEAD node, in order to preserve data lineage and prevent the underlying data from changing after the query or filter conditions have been evaluated.

Example

>>> import deeplake
>>> # load dataset
>>> ds = deeplake.load("hub://activeloop/mnist-train")
>>> # filter dataset
>>> zeros = ds.filter("labels == 0")
>>> # save view
>>> zeros.save_view(id="zeros")
>>> # load_view
>>> zeros = ds.load_view(id="zeros")
>>> len(zeros)
5923

Dataset.query

Returns a sliced Dataset with given query results.

Dataset.sample_by

Returns a sliced Dataset with given weighted sampler applied.

Dataset.filter

Filters the dataset in accordance of filter function f(x: sample) -> bool

Dataset.save_view

Saves a dataset view as a virtual dataset (VDS)

Dataset.get_view

Returns the dataset view corresponding to id.

Dataset.load_view

Loads the view and returns the Dataset by id.

Dataset.delete_view

Deletes the view with given view id.

Dataset.get_views

Returns list of views stored in this Dataset.

Dataset.is_view

Returns True if this dataset is a view and False otherwise.

Dataset.min_view

Returns a view of the dataset in which all tensors are sliced to have the same length as the shortest tensor.

Dataset.max_view

Returns a view of the dataset in which shorter tensors are padded with None s to have the same length as the longest tensor.