Datasets

Creating Datasets

deeplake.dataset

Returns a Dataset object referencing either a new or existing dataset.

deeplake.empty

Creates an empty dataset

deeplake.like

Creates a new dataset by copying the source dataset's structure to a new location.

deeplake.ingest

Ingests a dataset from a source and stores it as a structured dataset to destination.

deeplake.ingest_coco

Ingest images and annotations in COCO format to a Deep Lake Dataset.

deeplake.ingest_yolo

Ingest images and annotations (bounding boxes or polygons) in YOLO format to a Deep Lake Dataset.

deeplake.ingest_kaggle

Download and ingest a kaggle dataset and store it as a structured dataset to destination.

deeplake.ingest_dataframe

Convert pandas dataframe to a Deep Lake Dataset.

deeplake.ingest_huggingface

Converts Hugging Face datasets to Deep Lake format.

Loading Datasets

deeplake.load

Loads an existing dataset

Deleting and Renaming Datasets

deeplake.delete

Deletes a dataset at a given path.

deeplake.rename

Renames dataset at old_path to new_path.

Copying Datasets

deeplake.copy

Copies dataset at src to dest.

deeplake.deepcopy

Copies dataset at src to dest including version control history.

Dataset Operations

Dataset.summary

Prints a summary of the dataset.

Dataset.append

Append samples to mutliple tensors at once.

Dataset.extend

Appends multiple rows of samples to mutliple tensors at once.

Dataset.query

Returns a sliced Dataset with given query results.

Dataset.copy

Copies this dataset or dataset view to dest.

Dataset.delete

Deletes the entire dataset from the cache layers (if any) and the underlying storage.

Dataset.rename

Renames the dataset to path.

Dataset.connect

Connect a Deep Lake cloud dataset through a deeplake path.

Dataset.visualize

Visualizes the dataset in the Jupyter notebook.

Dataset.pop

Removes a sample from all the tensors of the dataset.

Dataset.rechunk

Rewrites the underlying chunks to make their sizes optimal.

Dataset.flush

Necessary operation after writes if caches are being used.

Dataset.clear_cache

  • Flushes (see Dataset.flush()) the contents of the cache layers (if any) and then deletes contents of all the layers of it.

Dataset.size_approx

Estimates the size in bytes of the dataset.

Dataset Visualization

Dataset.visualize

Visualizes the dataset in the Jupyter notebook.

Dataset Credentials

Dataset.add_creds_key

Adds a new creds key to the dataset.

Dataset.populate_creds

Populates the creds key added in add_creds_key with the given creds.

Dataset.update_creds_key

Replaces the old creds key with the new creds key.

Dataset.change_creds_management

Changes the management status of the creds key.

Dataset.get_creds_keys

Returns the list of creds keys added to the dataset.

Dataset Properties

Dataset.tensors

All tensors belonging to this group, including those within sub groups.

Dataset.groups

All sub groups in this group

Dataset.num_samples

Returns the length of the smallest tensor.

Dataset.read_only

Returns True if dataset is in read-only mode and False otherwise.

Dataset.info

Returns the information about the dataset.

Dataset.max_len

Return the maximum length of the tensor.

Dataset.min_len

Return the minimum length of the tensor.

Dataset Version Control

Dataset.commit

Stores a snapshot of the current state of the dataset.

Dataset.diff

Returns/displays the differences between commits/branches.

Dataset.checkout

Checks out to a specific commit_id or branch.

Dataset.merge

Merges the target_id into the current dataset.

Dataset.log

Displays the details of all the past commits.

Dataset.reset

Resets the uncommitted changes present in the branch.

Dataset.get_commit_details

Get details of a particular commit.

Dataset.commit_id

The lasted committed commit id of the dataset.

Dataset.branch

The current branch of the dataset

Dataset.pending_commit_id

The commit_id of the next commit that will be made to the dataset.

Dataset.has_head_changes

Returns True if currently at head node and uncommitted changes are present.

Dataset.commits

Lists all the commits leading to the current dataset state.

Dataset.branches

Lists all the branches of the dataset.

Dataset Views

A dataset view is a subset of a dataset that points to specific samples (indices) in an existing dataset. Dataset views can be created by indexing a dataset, filtering a dataset with Dataset.filter(), querying a dataset with Dataset.query() or by sampling a dataset with Dataset.sample_by(). Filtering is done with user-defined functions or simplified expressions whereas query can perform SQL-like queries with our Tensor Query Language. See the full TQL spec here.

Dataset views can only be saved when a dataset has been committed and has no changes on the HEAD node, in order to preserve data lineage and prevent the underlying data from changing after the query or filter conditions have been evaluated.

Example

>>> import deeplake
>>> # load dataset
>>> ds = deeplake.load("hub://activeloop/mnist-train")
>>> # filter dataset
>>> zeros = ds.filter("labels == 0")
>>> # save view
>>> zeros.save_view(id="zeros")
>>> # load_view
>>> zeros = ds.load_view(id="zeros")
>>> len(zeros)
5923

Dataset.query

Returns a sliced Dataset with given query results.

Dataset.sample_by

Returns a sliced Dataset with given weighted sampler applied.

Dataset.filter

Filters the dataset in accordance of filter function f(x: sample) -> bool

Dataset.save_view

Saves a dataset view as a virtual dataset (VDS)

Dataset.get_view

Returns the dataset view corresponding to id.

Dataset.load_view

Loads the view and returns the Dataset by id.

Dataset.delete_view

Deletes the view with given view id.

Dataset.get_views

Returns list of views stored in this Dataset.

Dataset.is_view

Returns True if this dataset is a view and False otherwise.

Dataset.min_view

Returns a view of the dataset in which all tensors are sliced to have the same length as the shortest tensor.

Dataset.max_view

Returns a view of the dataset in which shorter tensors are padded with None s to have the same length as the longest tensor.