Skip to content

Dataset APIs

Dataset Management

Datasets can be created, loaded, and managed through static factory methods in the deeplake module.

deeplake.create

create(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
    schema: SchemaTemplate | None = None,
) -> Dataset

Creates a new dataset at the given URL.

To open an existing dataset, use deeplake.open

Parameters:

Name Type Description Default
url str

The URL of the dataset. URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None
schema dict

The initial schema to use for the dataset. See deeplake.schema such as deeplake.schemas.TextEmbeddings for common starting schemas.

None

Examples:

>>> import deeplake
>>> from deeplake import types
>>>
>>> # Create a dataset in your local filesystem:
>>> ds = deeplake.create("directory_path")
>>> ds.add_column("id", types.Int32())
>>> ds.add_column("url", types.Text())
>>> ds.add_column("embedding", types.Embedding(768))
>>> ds.commit()
>>> ds.summary()
Dataset(columns=(id,url,embedding), length=0)
+---------+-------------------------------------------------------+
| column  |                         type                          |
+---------+-------------------------------------------------------+
|   id    |               kind=generic, dtype=int32               |
+---------+-------------------------------------------------------+
|   url   |                         text                          |
+---------+-------------------------------------------------------+
|embedding|kind=embedding, dtype=array(dtype=float32, shape=[768])|
+---------+-------------------------------------------------------+
>>> # Create dataset in your app.activeloop.ai organization:
>>> ds = deeplake.create("al://organization_id/dataset_name")
>>> # Create a dataset stored in your cloud using specified credentials:
>>> ds = deeplake.create("s3://mybucket/my_dataset",
>>>     creds = {"aws_access_key_id": ..., ...})
>>> # Create dataset stored in your cloud using app.activeloop.ai managed credentials.
>>> ds = deeplake.create("s3://mybucket/my_dataset",
>>>     creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
>>> # Create dataset stored in your cloud using app.activeloop.ai managed credentials.
>>> ds = deeplake.create("azure://bucket/path/to/dataset")
>>> ds = deeplake.create("gcs://bucket/path/to/dataset")
>>> ds = deeplake.create("mem://in-memory")

Raises:

Type Description
ValueError

if a dataset already exists at the given URL

deeplake.create_async

create_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
    schema: SchemaTemplate | None = None,
) -> Future

Asynchronously creates a new dataset at the given URL.

See deeplake.create for more information.

To open an existing dataset, use deeplake.open_async.

Examples:

>>> import deeplake
>>> from deeplake import types
>>>
>>> # Asynchronously create a dataset in your local filesystem:
>>> ds = await deeplake.create_async("directory_path")
>>> await ds.add_column("id", types.Int32())
>>> await ds.add_column("url", types.Text())
>>> await ds.add_column("embedding", types.Embedding(768))
>>> await ds.commit()
>>> await ds.summary()  # Example of usage in an async context
>>> # Alternatively, create a dataset using .result().
>>> future_ds = deeplake.create_async("directory_path")
>>> ds = future_ds.result()  # Blocks until the dataset is created
>>> # Create a dataset in your app.activeloop.ai organization:
>>> ds = await deeplake.create_async("al://organization_id/dataset_name")
>>> # Create a dataset stored in your cloud using specified credentials:
>>> ds = await deeplake.create_async("s3://mybucket/my_dataset",
>>>     creds={"aws_access_key_id": ..., ...})
>>> # Create dataset stored in your cloud using app.activeloop.ai managed credentials.
>>> ds = await deeplake.create_async("s3://mybucket/my_dataset",
>>>     creds={"creds_key": "managed_creds_key"}, org_id="my_org_id")
>>> # Create dataset stored in your cloud using app.activeloop.ai managed credentials.
>>> ds = await deeplake.create_async("azure://bucket/path/to/dataset")
>>> ds = await deeplake.create_async("gcs://bucket/path/to/dataset")
>>> ds = await deeplake.create_async("mem://in-memory")

Raises:

Type Description
ValueError

if a dataset already exists at the given URL (will be raised when the future is awaited)

deeplake.open

open(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Opens an existing dataset, potenitally for modifying its content.

See deeplake.open_read_only for opening the dataset in read only mode

To create a new dataset, see deeplake.open

Parameters:

Name Type Description Default
url str

The URL of the dataset. URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

>>> # Load dataset managed by Deep Lake.
>>> ds = deeplake.open("al://organization_id/dataset_name")
>>> # Load dataset stored in your cloud using your own credentials.
>>> ds = deeplake.open("s3://bucket/my_dataset",
>>>     creds = {"aws_access_key_id": ..., ...})
>>> # Load dataset stored in your cloud using Deep Lake managed credentials.
>>> ds = deeplake.open("s3://bucket/my_dataset",
>>>     ...creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
>>> ds = deeplake.open("s3://bucket/path/to/dataset")
>>> ds = deeplake.open("azure://bucket/path/to/dataset")
>>> ds = deeplake.open("gcs://bucket/path/to/dataset")

deeplake.open_async

open_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Future

Asynchronously opens an existing dataset, potentially for modifying its content.

See deeplake.open for opening the dataset synchronously.

Examples:

>>> # Asynchronously load dataset managed by Deep Lake using await.
>>> ds = await deeplake.open_async("al://organization_id/dataset_name")
>>> # Asynchronously load dataset stored in your cloud using your own credentials.
>>> ds = await deeplake.open_async("s3://bucket/my_dataset",
>>>     creds={"aws_access_key_id": ..., ...})
>>> # Asynchronously load dataset stored in your cloud using Deep Lake managed credentials.
>>> ds = await deeplake.open_async("s3://bucket/my_dataset",
>>>     creds={"creds_key": "managed_creds_key"}, org_id="my_org_id")
>>> ds = await deeplake.open_async("s3://bucket/path/to/dataset")
>>> ds = await deeplake.open_async("azure://bucket/path/to/dataset")
>>> ds = await deeplake.open_async("gcs://bucket/path/to/dataset")
>>> # Alternatively, load the dataset using .result().
>>> future_ds = deeplake.open_async("al://organization_id/dataset_name")
>>> ds = future_ds.result()  # Blocks until the dataset is loaded

deeplake.open_read_only

open_read_only(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> ReadOnlyDataset

Opens an existing dataset in read-only mode.

See deeplake.open for opening datasets for modification.

Parameters:

Name Type Description Default
url str

The URL of the dataset.

URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token to authenticate user.

None

Examples:

>>> ds = deeplake.open_read_only("directory_path")
>>> ds.summary()
Dataset(columns=(id,url,embedding), length=0)
+---------+-------------------------------------------------------+
| column  |                         type                          |
+---------+-------------------------------------------------------+
|   id    |               kind=generic, dtype=int32               |
+---------+-------------------------------------------------------+
|   url   |                         text                          |
+---------+-------------------------------------------------------+
|embedding|kind=embedding, dtype=array(dtype=float32, shape=[768])|
+---------+-------------------------------------------------------+
>>> ds = deeplake.open_read_only("file:///path/to/dataset")
>>> ds = deeplake.open_read_only("s3://bucket/path/to/dataset")
>>> ds = deeplake.open_read_only("azure://bucket/path/to/dataset")
>>> ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")
>>> ds = deeplake.open_read_only("mem://in-memory")

deeplake.open_read_only_async

open_read_only_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Future

Asynchronously opens an existing dataset in read-only mode.

See deeplake.open_async for opening datasets for modification and deeplake.open_read_only for sync open.

Examples:

>>> # Asynchronously open a dataset in read-only mode:
>>> ds = await deeplake.open_read_only_async("directory_path")
>>> # Alternatively, open the dataset using .result().
>>> future_ds = deeplake.open_read_only_async("directory_path")
>>> ds = future_ds.result()  # Blocks until the dataset is loaded
>>> ds = await deeplake.open_read_only_async("file:///path/to/dataset")
>>> ds = await deeplake.open_read_only_async("s3://bucket/path/to/dataset")
>>> ds = await deeplake.open_read_only_async("azure://bucket/path/to/dataset")
>>> ds = await deeplake.open_read_only_async("gcs://bucket/path/to/dataset")
>>> ds = await deeplake.open_read_only_async("mem://in-memory")

deeplake.delete

delete(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Deletes an existing dataset.

Warning

This operation is irreversible. All data will be lost.

If concurrent processes are attempting to write to the dataset while it's being deleted, it may lead to data inconsistency. It's recommended to use this operation with caution.

deeplake.copy

copy(
    src: str,
    dst: str,
    src_creds: dict[str, str] | None = None,
    dst_creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Copies the dataset at the source URL to the destination URL.

NOTE: Currently private due to potential issues in file timestamp handling

Parameters:

Name Type Description Default
src str

The URL of the source dataset.

required
dst str

The URL of the destination dataset.

required
src_creds (dict, str)

The string ENV or a dictionary containing credentials used to access the source dataset at the path.

None
dst_creds (dict, str)

The string ENV or a dictionary containing credentials used to access the destination dataset at the path.

None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

>>> deeplake.copy("al://organization_id/source_dataset", "al://organization_id/destination_dataset")

deeplake.like

like(
    src: DatasetView,
    dest: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Creates a new dataset by copying the source dataset's structure to a new location.

Note

No data is copied.

Parameters:

Name Type Description Default
src DatasetView

The dataset to copy the structure from.

required
dest str

The URL to create the new dataset at. creds (dict, str, optional): The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
required
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

>>> ds = deeplake.like(src="az://bucket/existing/to/dataset",
>>>     dest="s3://bucket/new/dataset")

deeplake.from_parquet

from_parquet(url: str) -> ReadOnlyDataset

Opens a Parquet dataset in the deeplake format.

Parameters:

Name Type Description Default
url str

The URL of the Parquet dataset. If no protocol is specified, it assumes file://

required

deeplake.connect

connect(
    src: str,
    dest: str | None = None,
    org_id: str | None = None,
    creds_key: str | None = None,
    token: str | None = None,
) -> Dataset

Connects an existing dataset your app.activeloop.ai account.

Either dest or org_id is required but not both.

See deeplake.disconnect

Parameters:

Name Type Description Default
src str

The URL to the existing dataset.

required
dest str

Desired Activeloop url for the dataset entry. Example: al://my_org/dataset

None
org_id str

The id of the organization to store the dataset under. The dataset name will be based on the source dataset's name.

None
creds_key str

The creds_key of the managed credentials that will be used to access the source path. If not set, use the organization's default credentials.

None
token str

Activeloop token used to fetch the managed credentials.

None

Examples:

>>> ds = deeplake.connect("s3://bucket/path/to/dataset",
>>>     "al://my_org/dataset")
>>> ds = deeplake.connect("s3://bucket/path/to/dataset",
>>>     "al://my_org/dataset", creds_key="my_key")
>>> # Connect the dataset as al://my_org/dataset
>>> ds = deeplake.connect("s3://bucket/path/to/dataset",
>>>     org_id="my_org")
>>> ds = deeplake.connect("az://bucket/path/to/dataset",
>>>     "al://my_org/dataset", creds_key="my_key")
>>> ds = deeplake.connect("gcs://bucket/path/to/dataset",
>>>     "al://my_org/dataset", creds_key="my_key")

deeplake.disconnect

disconnect(url: str, token: str | None = None) -> None

Disconnect the dataset your Activeloop account.

See deeplake.connect

Note

Does not delete the stored data, it only removes the connection from the activeloop organization

Parameters:

Name Type Description Default
url str

The URL of the dataset.

required
token str

Activeloop token to authenticate user.

None

Examples:

>>> deeplake.disconnect("al://my_org/dataset_name")

deeplake.convert

convert(
    src: str, dst: str, dst_creds: Dict[str, str] = None
)

Copies the v3 dataset at src into a new dataset in the new v4 format.

deeplake.Dataset

Bases: DatasetView

Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.

Unlike deeplake.ReadOnlyDataset, instances of Dataset can be modified.

__getitem__

__getitem__(offset: int) -> Row
__getitem__(range: slice) -> RowRange
__getitem__(column: str) -> Column
__getitem__(
    input: int | slice | str,
) -> Row | RowRange | Column

Returns a subset of data from the Dataset

The result will depend on the type of value passed to the [] operator.

  • int: The zero-based offset of the single row to return. Returns a deeplake.Row
  • slice: A slice specifying the range of rows to return. Returns a deeplake.RowRange
  • str: A string specifying column to return all values from. Returns a deeplake.Column

Examples:

>>> row = ds[318]
>>> rows = ds[931:1038]
>>> column_data = ds["id"]

__getstate__

__getstate__() -> tuple

Returns a dict that can be pickled and used to restore this dataset.

Note

Pickling a dataset does not copy the dataset, it only saves attributes that can be used to restore the dataset.

__iter__

__iter__() -> Iterator[Row]

Row based iteration over the dataset.

Examples:

>>> for row in ds:
>>>     # process row
>>>     pass

__len__

__len__() -> int

The number of rows in the dataset

__repr__

__repr__() -> str

__setstate__

__setstate__(arg0: tuple) -> None

Restores dataset from a pickled state.

Parameters:

Name Type Description Default
arg0 dict

The pickled state used to restore the dataset.

required

add_column

add_column(
    name: str,
    dtype: DataType | str | Type | type | Callable,
    format: DataFormat | None = None,
) -> None

Add a new column to the dataset.

Any existing rows in the dataset will have a None value for the new column

Parameters:

Name Type Description Default
name str

The name of the column

required
dtype DataType | str | Type | type | Callable

The type of the column. Possible values include:

  • Values from deeplake.types such as "deeplake.types.Int32()"
  • Python types: str, int, float
  • Numpy types: such as np.int32
  • A function reference that returns one of the above types
required
format DataFormat

The format of the column, if applicable. Only required when the dtype is deeplake.types.DataType.

None

Examples:

>>> ds.add_column("labels", deeplake.types.Int32)
>>> ds.add_column("labels", "int32")
>>> ds.add_column("name", deeplake.types.Text())
>>> ds.add_column("json_data", deeplake.types.Dict())
>>> ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))
>>> ds.add_column("embedding", deeplake.types.Embedding(dtype=deeplake.types.Float32(), dimensions=768))

Raises:

Type Description
ColumnAlreadyExistsError

If a column with the same name already exists.

append

append(data: list[dict[str, Any]]) -> None
append(data: dict[str, Any]) -> None
append(data: DatasetView) -> None
append(
    data: (
        list[dict[str, Any]] | dict[str, Any] | DatasetView
    )
) -> None

Adds data to the dataset.

The data can be in a variety of formats:

  • A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
  • A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
  • A DatasetView that was generated through any mechanism

Parameters:

Name Type Description Default
data list[dict[str, Any]] | dict[str, Any] | DatasetView

The data to insert into the dataset.

required

Examples:

>>> ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})
>>> ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])
>>> ds.append({
>>>     "embedding": np.random.rand(4, 768),
>>>     "text": ["Hello World"] * 4})
>>> ds.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)
>>> ds.append(deeplake.from_parquet("./file.parquet"))

Raises:

Type Description
ColumnMissingAppendValueError

If any column is missing from the input data.

UnevenColumnsError

If the input data columns are not the same length.

InvalidTypeDimensions

If the input data does not match the column's dimensions.

batches

batches(
    batch_size: int, drop_last: bool = False
) -> Prefetcher

Return a deeplake.Prefetcher for this DatasetView

Parameters:

Name Type Description Default
batch_size int

Number of rows in each batch

required
drop_last bool

Whether to drop the final batch if it is incomplete

False

commit

commit(message: str | None = None) -> None

Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.

Parameters:

Name Type Description Default
message str

A message to store in history describing the changes made in the version

None

Examples:

>>> ds.commit()
>>> ds.commit("Added data from updated documents")

commit_async

commit_async(message: str | None = None) -> FutureVoid

Asynchronously commits changes you have made to the dataset.

See deeplake.Dataset.commit for more information.

Parameters:

Name Type Description Default
message str

A message to store in history describing the changes made in the commit

None

Examples:

>>> ds.commit_async().wait()
>>> ds.commit_async("Added data from updated documents").wait()
>>> await ds.commit_async()
>>> await ds.commit_async("Added data from updated documents")
>>> future = ds.commit_async() # then you can check if the future is completed using future.is_completed()

created_time property

created_time: datetime

When the dataset was created. The value is auto-generated at creation time.

delete

delete(offset: int) -> None

Delete a row from the dataset.

Parameters:

Name Type Description Default
offset int

The offset of the row within the dataset to delete

required

description instance-attribute

description: str

The description of the dataset. Setting the value will immediately persist the change without requiring a commit().

history property

history: History

This dataset's version history

id property

id: str

The unique identifier of the dataset. Value is auto-generated at creation time.

metadata property

metadata: Metadata

The metadata of the dataset.

name instance-attribute

name: str

The name of the dataset. Setting the value will immediately persist the change without requiring a commit().

pull

pull(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

pull_async

pull_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push_async but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

push

push(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pushes any new history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

push_async

push_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously Pushes new any history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull_async but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

pytorch

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name Type Description Default
transform Callable[[Any], Any]

A custom function to apply to each sample before returning it

None

Raises:

Type Description
ImportError

If pytorch is not installed

Examples:

>>> from torch.utils.data import DataLoader
>>>
>>> ds = deeplake.open("path/to/dataset")
>>> dataloader = DataLoader(ds.pytorch(), batch_size=60,
>>>                             shuffle=True, num_workers=10)
>>> for i_batch, sample_batched in enumerate(dataloader):
>>>      process_batch(sample_batched)

remove_column

remove_column(name: str) -> None

Remove the existing column from the dataset.

Parameters:

Name Type Description Default
name str

The name of the column to remove

required

Examples:

>>> ds.remove_column("name")

Raises:

Type Description
ColumnDoesNotExistsError

If a column with the specified name does not exists.

rename_column

rename_column(name: str, new_name: str) -> None

Renames the existing column in the dataset.

Parameters:

Name Type Description Default
name str

The name of the column to rename

required
new_name str

The new name to set to column

required

Examples:

>>> ds.rename_column("old_name", "new_name")

Raises:

Type Description
ColumnDoesNotExistsError

If a column with the specified name does not exists.

ColumnAlreadyExistsError

If a column with the specified new name already exists.

rollback

rollback() -> None

Reverts any in-progress changes to the dataset you have made. Does not revert any changes that have been committed.

rollback_async

rollback_async() -> FutureVoid

Asynchronously reverts any in-progress changes to the dataset you have made. Does not revert any changes that have been committed.

schema property

schema: Schema

The schema of the dataset.

summary

summary() -> None

Prints a summary of the dataset.

Examples:

>>> ds.summary()
Dataset(columns=(id,title,embedding), length=51611356)
+---------+-------------------------------------------------------+
| column  |                         type                          |
+---------+-------------------------------------------------------+
|   id    |               kind=generic, dtype=int32               |
+---------+-------------------------------------------------------+
|  title  |                         text                          |
+---------+-------------------------------------------------------+
|embedding|kind=embedding, dtype=array(dtype=float32, shape=[768])|
+---------+-------------------------------------------------------+

tag

tag(name: str, version: str | None = None) -> Tag

Tags a version of the dataset. If no version is given, the current version is tagged.

Parameters:

Name Type Description Default
name str

The name of the tag

required
version str | None

The version of the dataset to tag

None

tags property

tags: Tags

The collection of deeplake.Tags within the dataset

tensorflow

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type Description
ImportError

If TensorFlow is not installed

Examples:

>>> ds = deeplake.open("path/to/dataset")
>>> dl = ds.tensorflow().shuffle(500).batch(32).
>>> for i_batch, sample_batched in enumerate(dataloader):
>>>      process_batch(sample_batched)

version property

version: str

The currently checked out version of the dataset

deeplake.ReadOnlyDataset

Bases: DatasetView

__getitem__

__getitem__(offset: int) -> RowView
__getitem__(range: slice) -> RowRangeView
__getitem__(column: str) -> ColumnView
__getitem__(
    input: int | slice | str,
) -> RowView | RowRangeView | ColumnView

Returns a subset of data from the dataset.

The result will depend on the type of value passed to the [] operator.

Examples:

>>> row = ds[318]
>>> rows = ds[931:1038]
>>> column_data = ds["id"]

__iter__

__iter__() -> Iterator[RowView]

Row based iteration over the dataset.

Examples:

>>> for row in ds:
>>>     # process row
>>>     pass

__len__

__len__() -> int

The number of rows in the dataset

batches

batches(
    batch_size: int, drop_last: bool = False
) -> Prefetcher

Return a deeplake.Prefetcher for this DatasetView

Parameters:

Name Type Description Default
batch_size int

Number of rows in each batch

required
drop_last bool

Whether to drop the final batch if it is incomplete

False

created_time property

created_time: datetime

When the dataset was created. The value is auto-generated at creation time.

description property

description: str

The description of the dataset

history property

history: History

The history of the overall dataset configuration.

id property

id: str

The unique identifier of the dataset. Value is auto-generated at creation time.

metadata property

metadata: ReadOnlyMetadata

The metadata of the dataset.

name property

name: str

The name of the dataset.

push

push(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pushes any history from this dataset to the dataset at the given url

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

push_async

push_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously Pushes any history from this dataset to the dataset at the given url

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

pytorch

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name Type Description Default
transform Callable[[Any], Any]

A custom function to apply to each sample before returning it

None

Raises:

Type Description
ImportError

If pytorch is not installed

Examples:

>>> from torch.utils.data import DataLoader
>>>
>>> ds = deeplake.open("path/to/dataset")
>>> dataloader = DataLoader(ds.pytorch(), batch_size=60,
>>>                             shuffle=True, num_workers=10)
>>> for i_batch, sample_batched in enumerate(dataloader):
>>>      process_batch(sample_batched)

query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

>>> result = ds.query("select * where category == 'active'")
>>> for row in result:
>>>     print("Id is: ", row["id"])

query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

>>> future = ds.query_async("select * where category == 'active'")
>>> result = future.result()
>>> for row in result:
>>>     print("Id is: ", row["id"])
>>> # or use the Future in an await expression
>>> future = ds.query_async("select * where category == 'active'")
>>> result = await future
>>> for row in result:
>>>     print("Id is: ", row["id"])

schema property

schema: SchemaView

The schema of the dataset.

summary

summary() -> None

Prints a summary of the dataset.

Examples:

>>> ds.summary()
Dataset(columns=(id,title,embedding), length=51611356)
+---------+-------------------------------------------------------+
| column  |                         type                          |
+---------+-------------------------------------------------------+
|   id    |               kind=generic, dtype=int32               |
+---------+-------------------------------------------------------+
|  title  |                         text                          |
+---------+-------------------------------------------------------+
|embedding|kind=embedding, dtype=array(dtype=float32, shape=[768])|
+---------+-------------------------------------------------------+

tags property

tags: TagsView

The collection of deeplake.TagViews within the dataset

tensorflow

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type Description
ImportError

If TensorFlow is not installed

Examples:

>>> ds = deeplake.open("path/to/dataset")
>>> dl = ds.tensorflow().shuffle(500).batch(32).
>>> for i_batch, sample_batched in enumerate(dataloader):
>>>      process_batch(sample_batched)

version property

version: str

The currently checked out version of the dataset

deeplake.DatasetView

A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.

__getitem__

__getitem__(offset: int) -> RowView
__getitem__(range: slice) -> RowRangeView
__getitem__(column: str) -> ColumnView
__getitem__(
    input: int | slice | str,
) -> RowView | RowRangeView | ColumnView

Returns a subset of data from the DatasetView.

The result will depend on the type of value passed to the [] operator.

Examples:

>>> ds = deeplake.create("mem://")
>>> ds.add_column("id", int)
>>> ds.add_column("name", str)
>>> ds.append({"id": [1,2,3], "name": ["Mary", "Joe", "Bill"]})
>>>
>>> row = ds[1]
>>> print("Id:", row["id"], "Name:", row["name"])
Id: 2 Name: Joe
>>> rows = ds[1:2]
>>> print(rows["id"])
>>> column_data = ds["id"]

__iter__

__iter__() -> Iterator[RowView]

Row based iteration over the dataset.

Examples:

>>> for row in ds:
>>>     # process row
>>>     pass

__len__

__len__() -> int

The number of rows in the dataset

batches

batches(
    batch_size: int, drop_last: bool = False
) -> Prefetcher

Return a deeplake.Prefetcher for this DatasetView

Parameters:

Name Type Description Default
batch_size int

Number of rows in each batch

required
drop_last bool

Whether to drop the final batch if it is incomplete

False

pytorch

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name Type Description Default
transform Callable[[Any], Any]

A custom function to apply to each sample before returning it

None

Raises:

Type Description
ImportError

If pytorch is not installed

Examples:

>>> from torch.utils.data import DataLoader
>>>
>>> ds = deeplake.open("path/to/dataset")
>>> dataloader = DataLoader(ds.pytorch(), batch_size=60,
>>>                             shuffle=True, num_workers=10)
>>> for i_batch, sample_batched in enumerate(dataloader):
>>>      process_batch(sample_batched)

query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

>>> result = ds.query("select * where category == 'active'")
>>> for row in result:
>>>     print("Id is: ", row["id"])

query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

>>> future = ds.query_async("select * where category == 'active'")
>>> result = future.result()
>>> for row in result:
>>>     print("Id is: ", row["id"])
>>> # or use the Future in an await expression
>>> future = ds.query_async("select * where category == 'active'")
>>> result = await future
>>> for row in result:
>>>     print("Id is: ", row["id"])

schema property

schema: SchemaView

The schema of the dataset.

summary

summary() -> None

Prints a summary of the dataset.

Examples:

>>> ds.summary()
Dataset(columns=(id,title,embedding), length=51611356)
+---------+-------------------------------------------------------+
| column  |                         type                          |
+---------+-------------------------------------------------------+
|   id    |               kind=generic, dtype=int32               |
+---------+-------------------------------------------------------+
|  title  |                         text                          |
+---------+-------------------------------------------------------+
|embedding|kind=embedding, dtype=array(dtype=float32, shape=[768])|
+---------+-------------------------------------------------------+

tensorflow

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type Description
ImportError

If TensorFlow is not installed

Examples:

>>> ds = deeplake.open("path/to/dataset")
>>> dl = ds.tensorflow().shuffle(500).batch(32).
>>> for i_batch, sample_batched in enumerate(dataloader):
>>>      process_batch(sample_batched)

deeplake.Column

Bases: ColumnView

__getitem__

__getitem__(index: int | slice) -> Any

__len__

__len__() -> int

__setitem__

__setitem__(index: int | slice, value: Any) -> None

get_async

get_async(index: int | slice) -> Future

metadata property

metadata: Metadata

name property

name: str

set_async

set_async(index: int | slice, value: Any) -> FutureVoid

deeplake.ColumnView

Provides access to a column in a dataset.

__getitem__

__getitem__(index: int | slice) -> Any

__len__

__len__() -> int

get_async

get_async(index: int | slice) -> Future

metadata property

metadata: ReadOnlyMetadata

name property

name: str

deeplake.Row

Provides mutable access to a particular row in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

__setitem__

__setitem__(column: str, value: Any) -> None

Change the value for the given column

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

>>> future = row.get_async("column_name")
>>> column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

row_id property

row_id: int

The row_id of the row

set_async

set_async(column: str, value: Any) -> FutureVoid

Asynchronously sets a value for the specified column and returns a FutureVoid object.

Parameters:

Name Type Description Default
column str

The name of the column to update.

required
value Any

The value to set for the column.

required

Returns:

Name Type Description
FutureVoid FutureVoid

A FutureVoid object that will resolve when the operation is complete.

Examples:

>>> future_void = row.set_async("column_name", new_value)
>>> future_void.wait()  # Blocks until the operation is complete.
Notes
  • The method sets the value asynchronously and immediately returns a FutureVoid.
  • You can either block and wait for the operation to complete using wait() or await the FutureVoid object in an asynchronous context.

deeplake.RowView

Provides access to a particular row in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

>>> future = row_view.get_async("column_name")
>>> column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

row_id property

row_id: int

The row_id of the row

deeplake.RowRange

Provides mutable access to a row range in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

__iter__

__iter__() -> Iterator[Row]

Iterate over the row range

__len__

__len__() -> int

The number of rows in the row range

__setitem__

__setitem__(column: str, value: Any) -> None

Change the value for the given column

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

>>> future = row_range.get_async("column_name")
>>> column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

set_async

set_async(column: str, value: Any) -> FutureVoid

Asynchronously sets a value for the specified column and returns a FutureVoid object.

Parameters:

Name Type Description Default
column str

The name of the column to update.

required
value Any

The value to set for the column.

required

Returns:

Name Type Description
FutureVoid FutureVoid

A FutureVoid object that will resolve when the operation is complete.

Examples:

>>> future_void = row_range.set_async("column_name", new_value)
>>> future_void.wait()  # Blocks until the operation is complete.
Notes
  • The method sets the value asynchronously and immediately returns a FutureVoid.
  • You can either block and wait for the operation to complete using wait() or await the FutureVoid object in an asynchronous context.

deeplake.RowRangeView

Provides access to a row range in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

__iter__

__iter__() -> Iterator[RowView]

Iterate over the row range

__len__

__len__() -> int

The number of rows in the row range

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

>>> future = row_range_view.get_async("column_name")
>>> column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

deeplake.Future

A future that represents a value that will be resolved in the future.

Once the Future is resolved, it will hold the result, and you can retrieve it using either a blocking call (result()) or via asynchronous mechanisms (await).

The future will resolve automatically even if you do not explicitly wait for it.

Methods:

Name Description
result

Blocks until the Future is resolved and returns the object.

__await__

Awaits the future asynchronously and returns the object once it's ready.

is_completed

Returns True if the Future is already resolved, False otherwise.

__await__

__await__() -> Any

Awaits the resolution of the Future asynchronously.

Examples:

>>> result = await future

Returns:

Type Description
Any

typing.Any: The result when the Future is resolved.

is_completed

is_completed() -> bool

Checks if the Future has been resolved.

Returns:

Name Type Description
bool bool

True if the Future is resolved, False otherwise.

result

result() -> Any

Blocks until the Future is resolved, then returns the result.

Returns:

Type Description
Any

typing.Any: The result when the Future is resolved.

deeplake.FutureVoid

A future that represents the completion of an operation that returns no result.

The future will resolve automatically to None, even if you do not explicitly wait for it.

Methods:

Name Description
wait

Blocks until the FutureVoid is resolved and then returns None.

__await__

Awaits the FutureVoid asynchronously and returns None once the operation is complete.

is_completed

Returns True if the FutureVoid is already resolved, False otherwise.

__await__

__await__() -> None

Awaits the resolution of the FutureVoid asynchronously.

Examples:

>>> await future_void  # Waits for the completion of the async operation.

Returns:

Name Type Description
None None

Indicates the operation has completed.

is_completed

is_completed() -> bool

Checks if the FutureVoid has been resolved.

Returns:

Name Type Description
bool bool

True if the FutureVoid is resolved, False otherwise.

wait

wait() -> None

Blocks until the FutureVoid is resolved, then returns None.

Examples:

>>> future_void.wait()  # Blocks until the operation completes.

Returns:

Name Type Description
None None

Indicates the operation has completed.