Skip to content

Dataset APIs

Dataset Management

Datasets can be created, loaded, and managed through static factory methods in the deeplake module.

deeplake.create

create(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
    schema: SchemaTemplate | None = None,
) -> Dataset

Creates a new dataset at the given URL.

To open an existing dataset, use deeplake.open

Parameters:

Name Type Description Default
url str

The URL of the dataset. URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None
schema dict

The initial schema to use for the dataset. See deeplake.schema such as deeplake.schemas.TextEmbeddings for common starting schemas.

None

Examples:

# Create a dataset in your local filesystem:
ds = deeplake.create("directory_path")
ds.add_column("id", types.Int32())
ds.add_column("url", types.Text())
ds.add_column("embedding", types.Embedding(768))
ds.commit()
ds.summary()
# Create dataset in your app.activeloop.ai organization:
ds = deeplake.create("al://organization_id/dataset_name")

# Create a dataset stored in your cloud using specified credentials:
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.create("azure://bucket/path/to/dataset")

ds = deeplake.create("gcs://bucket/path/to/dataset")

ds = deeplake.create("mem://in-memory")

Raises:

Type Description
ValueError

if a dataset already exists at the given URL

deeplake.create_async

create_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
    schema: SchemaTemplate | None = None,
) -> Future

Asynchronously creates a new dataset at the given URL.

See deeplake.create for more information.

To open an existing dataset, use deeplake.open_async.

Examples:

async def create_dataset():
    # Asynchronously create a dataset in your local filesystem:
    ds = await deeplake.create_async("directory_path")
    await ds.add_column("id", types.Int32())
    await ds.add_column("url", types.Text())
    await ds.add_column("embedding", types.Embedding(768))
    await ds.commit()
    await ds.summary()  # Example of usage in an async context

    # Alternatively, create a dataset using .result().
    future_ds = deeplake.create_async("directory_path")
    ds = future_ds.result()  # Blocks until the dataset is created

    # Create a dataset in your app.activeloop.ai organization:
    ds = await deeplake.create_async("al://organization_id/dataset_name")

    # Create a dataset stored in your cloud using specified credentials:
    ds = await deeplake.create_async("s3://mybucket/my_dataset",
        creds={"aws_access_key_id": id, "aws_secret_access_key": key})

    # Create dataset stored in your cloud using app.activeloop.ai managed credentials.
    ds = await deeplake.create_async("s3://mybucket/my_dataset",
        creds={"creds_key": "managed_creds_key"}, org_id="my_org_id")

    ds = await deeplake.create_async("azure://bucket/path/to/dataset")

    ds = await deeplake.create_async("gcs://bucket/path/to/dataset")

    ds = await deeplake.create_async("mem://in-memory")

Raises:

Type Description
ValueError

if a dataset already exists at the given URL (will be raised when the future is awaited)

deeplake.open

open(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Opens an existing dataset, potenitally for modifying its content.

See deeplake.open_read_only for opening the dataset in read only mode

To create a new dataset, see deeplake.open

Parameters:

Name Type Description Default
url str

The URL of the dataset. URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

# Load dataset managed by Deep Lake.
ds = deeplake.open("al://organization_id/dataset_name")

# Load dataset stored in your cloud using your own credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Load dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.open("s3://bucket/path/to/dataset")

ds = deeplake.open("azure://bucket/path/to/dataset")

ds = deeplake.open("gcs://bucket/path/to/dataset")

deeplake.open_async

open_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Future

Asynchronously opens an existing dataset, potentially for modifying its content.

See deeplake.open for opening the dataset synchronously.

Examples:

async def async_open():
    # Asynchronously load dataset managed by Deep Lake using await.
    ds = await deeplake.open_async("al://organization_id/dataset_name")

    # Asynchronously load dataset stored in your cloud using your own credentials.
    ds = await deeplake.open_async("s3://bucket/my_dataset",
        creds={"aws_access_key_id": id, "aws_secret_access_key": key})

    # Asynchronously load dataset stored in your cloud using Deep Lake managed credentials.
    ds = await deeplake.open_async("s3://bucket/my_dataset",
        creds={"creds_key": "managed_creds_key"}, org_id="my_org_id")

    ds = await deeplake.open_async("s3://bucket/path/to/dataset")

    ds = await deeplake.open_async("azure://bucket/path/to/dataset")

    ds = await deeplake.open_async("gcs://bucket/path/to/dataset")

    # Alternatively, load the dataset using .result().
    future_ds = deeplake.open_async("al://organization_id/dataset_name")
    ds = future_ds.result()  # Blocks until the dataset is loaded

deeplake.open_read_only

open_read_only(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> ReadOnlyDataset

Opens an existing dataset in read-only mode.

See deeplake.open for opening datasets for modification.

Parameters:

Name Type Description Default
url str

The URL of the dataset.

URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token to authenticate user.

None

Examples:

ds = deeplake.open_read_only("directory_path")
ds.summary()

Example Output:
Dataset length: 5
Columns:
  id       : int32
  url      : text
  embedding: embedding(768)

ds = deeplake.open_read_only("file:///path/to/dataset")

ds = deeplake.open_read_only("s3://bucket/path/to/dataset")

ds = deeplake.open_read_only("azure://bucket/path/to/dataset")

ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")

ds = deeplake.open_read_only("mem://in-memory")

deeplake.open_read_only_async

open_read_only_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Future

Asynchronously opens an existing dataset in read-only mode.

See deeplake.open_async for opening datasets for modification and deeplake.open_read_only for sync open.

Examples:

# Asynchronously open a dataset in read-only mode:
ds = await deeplake.open_read_only_async("directory_path")

# Alternatively, open the dataset using .result().
future_ds = deeplake.open_read_only_async("directory_path")
ds = future_ds.result()  # Blocks until the dataset is loaded

ds = await deeplake.open_read_only_async("file:///path/to/dataset")

ds = await deeplake.open_read_only_async("s3://bucket/path/to/dataset")

ds = await deeplake.open_read_only_async("azure://bucket/path/to/dataset")

ds = await deeplake.open_read_only_async("gcs://bucket/path/to/dataset")

ds = await deeplake.open_read_only_async("mem://in-memory")

deeplake.delete

delete(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Deletes an existing dataset.

Warning

This operation is irreversible. All data will be lost.

If concurrent processes are attempting to write to the dataset while it's being deleted, it may lead to data inconsistency. It's recommended to use this operation with caution.

deeplake.copy

copy(
    src: str,
    dst: str,
    src_creds: dict[str, str] | None = None,
    dst_creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Copies the dataset at the source URL to the destination URL.

NOTE: Currently private due to potential issues in file timestamp handling

Parameters:

Name Type Description Default
src str

The URL of the source dataset.

required
dst str

The URL of the destination dataset.

required
src_creds (dict, str)

The string ENV or a dictionary containing credentials used to access the source dataset at the path.

None
dst_creds (dict, str)

The string ENV or a dictionary containing credentials used to access the destination dataset at the path.

None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

deeplake.copy("al://organization_id/source_dataset", "al://organization_id/destination_dataset")

deeplake.like

like(
    src: DatasetView,
    dest: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Creates a new dataset by copying the source dataset's structure to a new location.

Note

No data is copied.

Parameters:

Name Type Description Default
src DatasetView

The dataset to copy the structure from.

required
dest str

The URL to create the new dataset at. creds (dict, str, optional): The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
required
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

ds = deeplake.like(src="az://bucket/existing/to/dataset",
   dest="s3://bucket/new/dataset")

deeplake.from_parquet

from_parquet(url: str) -> ReadOnlyDataset

Opens a Parquet dataset in the deeplake format.

Parameters:

Name Type Description Default
url str

The URL of the Parquet dataset. If no protocol is specified, it assumes file://

required

deeplake.connect

connect(
    src: str,
    dest: str | None = None,
    org_id: str | None = None,
    creds_key: str | None = None,
    token: str | None = None,
) -> Dataset

Connects an existing dataset your app.activeloop.ai account.

Either dest or org_id is required but not both.

See deeplake.disconnect

Parameters:

Name Type Description Default
src str

The URL to the existing dataset.

required
dest str

Desired Activeloop url for the dataset entry. Example: al://my_org/dataset

None
org_id str

The id of the organization to store the dataset under. The dataset name will be based on the source dataset's name.

None
creds_key str

The creds_key of the managed credentials that will be used to access the source path. If not set, use the organization's default credentials.

None
token str

Activeloop token used to fetch the managed credentials.

None

Examples:

```python ds = deeplake.connect("s3://bucket/path/to/dataset", "al://my_org/dataset")

ds = deeplake.connect("s3://bucket/path/to/dataset", "al://my_org/dataset", creds_key="my_key")

Connect the dataset as al://my_org/dataset

ds = deeplake.connect("s3://bucket/path/to/dataset", org_id="my_org")

ds = deeplake.connect("az://bucket/path/to/dataset", "al://my_org/dataset", creds_key="my_key")

ds = deeplake.connect("gcs://bucket/path/to/dataset", "al://my_org/dataset", creds_key="my_key")

deeplake.disconnect

disconnect(url: str, token: str | None = None) -> None

Disconnect the dataset your Activeloop account.

See deeplake.connect

Note

Does not delete the stored data, it only removes the connection from the activeloop organization

Parameters:

Name Type Description Default
url str

The URL of the dataset.

required
token str

Activeloop token to authenticate user.

None

Examples:

deeplake.disconnect("al://my_org/dataset_name")

deeplake.convert

convert(
    src: str,
    dst: str,
    dst_creds: Optional[Dict[str, str]] = None,
    token: Optional[str] = None,
) -> None

Copies the v3 dataset at src into a new dataset in the new v4 format.

deeplake.Dataset

Bases: DatasetView

Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.

Unlike deeplake.ReadOnlyDataset, instances of Dataset can be modified.

__getitem__

__getitem__(offset: int) -> Row
__getitem__(range: slice) -> RowRange
__getitem__(indices: list) -> RowRange
__getitem__(indices: tuple) -> RowRange
__getitem__(column: str) -> Column
__getitem__(
    input: int | slice | list | tuple | str,
) -> Row | RowRange | Column

Returns a subset of data from the Dataset

The result will depend on the type of value passed to the [] operator.

  • int: The zero-based offset of the single row to return. Returns a deeplake.Row
  • slice: A slice specifying the range of rows to return. Returns a deeplake.RowRange
  • list: A list of indices specifying the rows to return. Returns a deeplake.RowRange
  • tuple: A tuple of indices specifying the rows to return. Returns a deeplake.RowRange
  • str: A string specifying column to return all values from. Returns a deeplake.Column

Examples:

row = ds[318]

rows = ds[931:1038]

rows = ds[931:1038:3]

rows = ds[[1, 3, 5, 7]]

rows = ds[(1, 3, 5, 7)]

column_data = ds["id"]

__getstate__

__getstate__() -> tuple

Returns a dict that can be pickled and used to restore this dataset.

Note

Pickling a dataset does not copy the dataset, it only saves attributes that can be used to restore the dataset.

__iter__

__iter__() -> Iterator[Row]

Row based iteration over the dataset.

Examples:

for row in ds:
    # process row
    pass

__len__

__len__() -> int

The number of rows in the dataset

__setstate__

__setstate__(arg0: tuple) -> None

Restores dataset from a pickled state.

Parameters:

Name Type Description Default
arg0 dict

The pickled state used to restore the dataset.

required

__str__

__str__() -> str

add_column

add_column(
    name: str,
    dtype: DataType | str | Type | type | Callable,
    format: DataFormat | None = None,
) -> None

Add a new column to the dataset.

Any existing rows in the dataset will have a None value for the new column

Parameters:

Name Type Description Default
name str

The name of the column

required
dtype DataType | str | Type | type | Callable

The type of the column. Possible values include:

  • Values from deeplake.types such as "deeplake.types.Int32()"
  • Python types: str, int, float
  • Numpy types: such as np.int32
  • A function reference that returns one of the above types
required
format DataFormat

The format of the column, if applicable. Only required when the dtype is deeplake.types.DataType.

None

Examples:

ds.add_column("labels", deeplake.types.Int32)

ds.add_column("categories", "int32")

ds.add_column("name", deeplake.types.Text())

ds.add_column("json_data", deeplake.types.Dict())

ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))

ds.add_column("embedding", deeplake.types.Embedding(size=768))

Raises:

Type Description
ColumnAlreadyExistsError

If a column with the same name already exists.

append

append(data: list[dict[str, Any]]) -> None
append(data: dict[str, Any]) -> None
append(data: DatasetView) -> None
append(
    data: (
        list[dict[str, Any]] | dict[str, Any] | DatasetView
    )
) -> None

Adds data to the dataset.

The data can be in a variety of formats:

  • A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
  • A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
  • A DatasetView that was generated through any mechanism

Parameters:

Name Type Description Default
data list[dict[str, Any]] | dict[str, Any] | DatasetView

The data to insert into the dataset.

required

Examples:

ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})

ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])
ds2.append({
    "embedding": np.random.rand(4, 768),
    "text": ["Hello World"] * 4})

ds2.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)
ds2.append(deeplake.from_parquet("./file.parquet"))

Raises:

Type Description
ColumnMissingAppendValueError

If any column is missing from the input data.

UnevenColumnsError

If the input data columns are not the same length.

InvalidTypeDimensions

If the input data does not match the column's dimensions.

batches

batches(
    batch_size: int, drop_last: bool = False
) -> Iterable

The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.

Parameters:

Name Type Description Default
batch_size int

Number of rows in each batch

required
drop_last bool

Whether to drop the final batch if it is incomplete

False

Examples:

ds = deeplake.open("al://my_org/dataset")
batches = ds.batches(batch_size=2000, drop_last=True)
for batch in batches:
    process_batch(batch["images"])

commit

commit(message: str | None = None) -> None

Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.

Parameters:

Name Type Description Default
message str

A message to store in history describing the changes made in the version

None

Examples:

ds.commit()

ds.commit("Added data from updated documents")

commit_async

commit_async(message: str | None = None) -> FutureVoid

Asynchronously commits changes you have made to the dataset.

See deeplake.Dataset.commit for more information.

Parameters:

Name Type Description Default
message str

A message to store in history describing the changes made in the commit

None

Examples:

```python ds.commit_async().wait()

ds.commit_async("Added data from updated documents").wait()

def do_commit(): await ds.commit_async()

future = ds.commit_async() # then you can check if the future is completed using future.is_completed()

created_time property

created_time: datetime

When the dataset was created. The value is auto-generated at creation time.

creds_key property

creds_key: str | None

The key used to store the credentials for the dataset.

delete

delete(offset: int) -> None

Delete a row from the dataset.

Parameters:

Name Type Description Default
offset int

The offset of the row within the dataset to delete

required

description instance-attribute

description: str

The description of the dataset. Setting the value will immediately persist the change without requiring a commit().

history property

history: History

This dataset's version history

id property

id: str

The unique identifier of the dataset. Value is auto-generated at creation time.

metadata property

metadata: Metadata

The metadata of the dataset.

name instance-attribute

name: str

The name of the dataset. Setting the value will immediately persist the change without requiring a commit().

pull

pull(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

pull_async

pull_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push_async but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

push

push(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pushes any new history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

push_async

push_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously Pushes new any history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull_async but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

pytorch

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name Type Description Default
transform Callable[[Any], Any]

A custom function to apply to each sample before returning it

None

Raises:

Type Description
ImportError

If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

remove_column

remove_column(name: str) -> None

Remove the existing column from the dataset.

Parameters:

Name Type Description Default
name str

The name of the column to remove

required

Examples:

ds.remove_column("name")

Raises:

Type Description
ColumnDoesNotExistsError

If a column with the specified name does not exists.

rename_column

rename_column(name: str, new_name: str) -> None

Renames the existing column in the dataset.

Parameters:

Name Type Description Default
name str

The name of the column to rename

required
new_name str

The new name to set to column

required

Examples:

ds.rename_column("old_name", "new_name")

Raises:

Type Description
ColumnDoesNotExistsError

If a column with the specified name does not exists.

ColumnAlreadyExistsError

If a column with the specified new name already exists.

rollback

rollback() -> None

Reverts any in-progress changes to the dataset you have made. Does not revert any changes that have been committed.

rollback_async

rollback_async() -> FutureVoid

Asynchronously reverts any in-progress changes to the dataset you have made. Does not revert any changes that have been committed.

schema property

schema: Schema

The schema of the dataset.

set_creds_key

set_creds_key(key: str, token: str | None = None) -> None

Sets the key used to store the credentials for the dataset.

summary

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

tag

tag(name: str, version: str | None = None) -> Tag

Tags a version of the dataset. If no version is given, the current version is tagged.

Parameters:

Name Type Description Default
name str

The name of the tag

required
version str | None

The version of the dataset to tag

None

tags property

tags: Tags

The collection of deeplake.Tags within the dataset

tensorflow

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type Description
ImportError

If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

version property

version: str

The currently checked out version of the dataset

deeplake.ReadOnlyDataset

Bases: DatasetView

__getitem__

__getitem__(offset: int) -> RowView
__getitem__(range: slice) -> RowRangeView
__getitem__(indices: list) -> RowRangeView
__getitem__(indices: tuple) -> RowRangeView
__getitem__(column: str) -> ColumnView
__getitem__(
    input: int | slice | list | tuple | str,
) -> RowView | RowRangeView | ColumnView

Returns a subset of data from the DatasetView.

The result will depend on the type of value passed to the [] operator.

  • int: The zero-based offset of the single row to return. Returns a deeplake.RowView
  • slice: A slice specifying the range of rows to return. Returns a deeplake.RowRangeView
  • list: A list of indices specifying the rows to return. Returns a deeplake.RowRangeView
  • tuple: A tuple of indices specifying the rows to return. Returns a [deeplake.RowRangeView
  • str: A string specifying column to return all values from. Returns a deeplake.ColumnView

Examples:

ds = deeplake.create("mem://")
ds.add_column("id", int)
ds.add_column("name", str)
ds.append({"id": [1,2,3], "name": ["Mary", "Joe", "Bill"]})

row = ds[1]
print("Id:", row["id"], "Name:", row["name"]) # Output: 2 Name: Joe
rows = ds[1:2]
print(rows["id"])

column_data = ds["id"]

__iter__

__iter__() -> Iterator[RowView]

Row based iteration over the dataset.

Examples:

for row in ds:
    # process row
    pass

__len__

__len__() -> int

The number of rows in the dataset

batches

batches(
    batch_size: int, drop_last: bool = False
) -> Iterable

The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.

Parameters:

Name Type Description Default
batch_size int

Number of rows in each batch

required
drop_last bool

Whether to drop the final batch if it is incomplete

False

Examples:

ds = deeplake.open("al://my_org/dataset")
batches = ds.batches(batch_size=2000, drop_last=True)
for batch in batches:
    process_batch(batch["images"])

created_time property

created_time: datetime

When the dataset was created. The value is auto-generated at creation time.

description property

description: str

The description of the dataset

history property

history: History

The history of the overall dataset configuration.

id property

id: str

The unique identifier of the dataset. Value is auto-generated at creation time.

metadata property

metadata: ReadOnlyMetadata

The metadata of the dataset.

name property

name: str

The name of the dataset.

push

push(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pushes any history from this dataset to the dataset at the given url

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

push_async

push_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously Pushes any history from this dataset to the dataset at the given url

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

pytorch

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name Type Description Default
transform Callable[[Any], Any]

A custom function to apply to each sample before returning it

None

Raises:

Type Description
ImportError

If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

schema property

schema: SchemaView

The schema of the dataset.

summary

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

tag

tag(name: str | None = None) -> Tag

Saves the current view as a tag to its source dataset and returns the tag.

tags property

tags: TagsView

The collection of deeplake.TagViews within the dataset

tensorflow

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type Description
ImportError

If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

version property

version: str

The currently checked out version of the dataset

deeplake.DatasetView

A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.

__getitem__

__getitem__(offset: int) -> RowView
__getitem__(range: slice) -> RowRangeView
__getitem__(indices: list) -> RowRangeView
__getitem__(indices: tuple) -> RowRangeView
__getitem__(column: str) -> ColumnView
__getitem__(
    input: int | slice | list | tuple | str,
) -> RowView | RowRangeView | ColumnView

Returns a subset of data from the DatasetView.

The result will depend on the type of value passed to the [] operator.

  • int: The zero-based offset of the single row to return. Returns a deeplake.RowView
  • slice: A slice specifying the range of rows to return. Returns a deeplake.RowRangeView
  • list: A list of indices specifying the rows to return. Returns a deeplake.RowRangeView
  • tuple: A tuple of indices specifying the rows to return. Returns a [deeplake.RowRangeView
  • str: A string specifying column to return all values from. Returns a deeplake.ColumnView

Examples:

ds = deeplake.create("mem://")
ds.add_column("id", int)
ds.add_column("name", str)
ds.append({"id": [1,2,3], "name": ["Mary", "Joe", "Bill"]})

row = ds[1]
print("Id:", row["id"], "Name:", row["name"]) # Output: 2 Name: Joe
rows = ds[1:2]
print(rows["id"])

column_data = ds["id"]

__iter__

__iter__() -> Iterator[RowView]

Row based iteration over the dataset.

Examples:

for row in ds:
    # process row
    pass

__len__

__len__() -> int

The number of rows in the dataset

batches

batches(
    batch_size: int, drop_last: bool = False
) -> Iterable

The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.

Parameters:

Name Type Description Default
batch_size int

Number of rows in each batch

required
drop_last bool

Whether to drop the final batch if it is incomplete

False

Examples:

ds = deeplake.open("al://my_org/dataset")
batches = ds.batches(batch_size=2000, drop_last=True)
for batch in batches:
    process_batch(batch["images"])

pytorch

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name Type Description Default
transform Callable[[Any], Any]

A custom function to apply to each sample before returning it

None

Raises:

Type Description
ImportError

If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

schema property

schema: SchemaView

The schema of the dataset.

summary

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

tag

tag(name: str | None = None) -> Tag

Saves the current view as a tag to its source dataset and returns the tag.

tensorflow

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type Description
ImportError

If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

deeplake.Column

Bases: ColumnView

Provides read-write access to a column in a dataset. Column extends ColumnView with methods for modifying data, making it suitable for dataset creation and updates in ML workflows.

The Column class allows you to: - Read and write data using integer indices, slices, or lists of indices - Modify data asynchronously for better performance - Access and modify column metadata - Handle various data types common in ML: images, embeddings, labels, etc.

Examples:

Update training labels:

# Update single label
ds["labels"][0] = 1

# Update batch of labels
ds["labels"][0:32] = new_labels

# Async update for better performance
future = ds["labels"].set_async(slice(0, 32), new_labels)
future.wait()

Store image embeddings:

# Generate and store embeddings
embeddings = model.encode(images)
ds["embeddings"][0:len(embeddings)] = embeddings

Manage column metadata:

# Store preprocessing parameters
ds["images"].metadata["mean"] = [0.485, 0.456, 0.406]
ds["images"].metadata["std"] = [0.229, 0.224, 0.225]

__getitem__

__getitem__(index: int | slice | list | tuple) -> Any

Retrieve data from the column at the specified index or range.

Parameters:

Name Type Description Default
index int | slice | list | tuple

Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) - list/tuple: Multiple specific indices

required

Returns:

Type Description
Any

The data at the specified index/indices. Type depends on the column's data type.

Examples:

# Get single item
image = column[0]

# Get range
batch = column[0:32]

# Get specific indices
items = column[[1, 5, 10]]

__len__

__len__() -> int

Get the number of items in the column.

Returns:

Name Type Description
int int

Number of items in the column.

__setitem__

__setitem__(index: int | slice, value: Any) -> None

Set data in the column at the specified index or range.

Parameters:

Name Type Description Default
index int | slice

Can be: - int: Single item index - slice: Range of indices (e.g., 0:10)

required
value Any

The data to store. Must match the column's data type.

required

Examples:

# Update single item
column[0] = new_image

# Update range
column[0:32] = new_batch

get_async

get_async(index: int | slice | list | tuple) -> Future

Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.

Parameters:

Name Type Description Default
index int | slice | list | tuple

Can be: - int: Single item index - slice: Range of indices - list/tuple: Multiple specific indices

required

Returns:

Name Type Description
Future Future

A Future object that resolves to the requested data.

Examples:

# Async batch load
future = column.get_async(slice(0, 32))
batch = future.result()

# Using with async/await
async def load_batch():
    batch = await column.get_async(slice(0, 32))
    return batch

metadata property

metadata: Metadata

name property

name: str

Get the name of the column.

Returns:

Name Type Description
str str

The column name.

set_async

set_async(index: int | slice, value: Any) -> FutureVoid

Asynchronously set data in the column. Useful for large updates or when modifying multiple items in ML pipelines.

Parameters:

Name Type Description Default
index int | slice

Can be: - int: Single item index - slice: Range of indices

required
value Any

The data to store. Must match the column's data type.

required

Returns:

Name Type Description
FutureVoid FutureVoid

A FutureVoid that completes when the update is finished.

Examples:

# Async batch update
future = column.set_async(slice(0, 32), new_batch)
future.wait()

# Using with async/await
async def update_batch():
    await column.set_async(slice(0, 32), new_batch)

deeplake.ColumnView

Provides read-only access to a column in a dataset. ColumnView is designed for efficient data access in ML workflows, supporting both synchronous and asynchronous operations.

The ColumnView class allows you to: - Access column data using integer indices, slices, or lists of indices - Retrieve data asynchronously for better performance in ML pipelines - Access column metadata and properties - Get information about linked data if the column contains references

Examples:

Load image data from a column for training:

# Access a single image
image = ds["images"][0]

# Load a batch of images
batch = ds["images"][0:32]

# Async load for better performance
images_future = ds["images"].get_async(slice(0, 32))
images = images_future.result()

Access embeddings for similarity search:

# Get all embeddings
embeddings = ds["embeddings"][:]

# Get specific embeddings by indices
selected = ds["embeddings"][[1, 5, 10]]

Check column properties:

# Get column name
name = ds["images"].name

# Access metadata
if "mean" in ds["images"].metadata.keys():
    mean = dataset["images"].metadata["mean"]

__getitem__

__getitem__(index: int | slice | list | tuple) -> Any

Retrieve data from the column at the specified index or range.

Parameters:

Name Type Description Default
index int | slice | list | tuple

Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) - list/tuple: Multiple specific indices

required

Returns:

Type Description
Any

The data at the specified index/indices. Type depends on the column's data type.

Examples:

# Get single item
image = column[0]

# Get range
batch = column[0:32]

# Get specific indices
items = column[[1, 5, 10]]

__len__

__len__() -> int

Get the number of items in the column.

Returns:

Name Type Description
int int

Number of items in the column.

get_async

get_async(index: int | slice | list | tuple) -> Future

Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.

Parameters:

Name Type Description Default
index int | slice | list | tuple

Can be: - int: Single item index - slice: Range of indices - list/tuple: Multiple specific indices

required

Returns:

Name Type Description
Future Future

A Future object that resolves to the requested data.

Examples:

# Async batch load
future = column.get_async(slice(0, 32))
batch = future.result()

# Using with async/await
async def load_batch():
    batch = await column.get_async(slice(0, 32))
    return batch

metadata property

metadata: ReadOnlyMetadata

Access the column's metadata. Useful for storing statistics, preprocessing parameters, or other information about the column data.

Returns:

Name Type Description
ReadOnlyMetadata ReadOnlyMetadata

A ReadOnlyMetadata object for reading metadata.

Examples:

# Access preprocessing parameters
mean = column.metadata["mean"]
std = column.metadata["std"]

# Check available metadata
for key in column.metadata.keys():
    print(f"{key}: {column.metadata[key]}")

name property

name: str

Get the name of the column.

Returns:

Name Type Description
str str

The column name.

deeplake.Row

Provides mutable access to a particular row in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

__setitem__

__setitem__(column: str, value: Any) -> None

Change the value for the given column

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

future = row.get_async("column_name")
column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

row_id property

row_id: int

The row_id of the row

set_async

set_async(column: str, value: Any) -> FutureVoid

Asynchronously sets a value for the specified column and returns a FutureVoid object.

Parameters:

Name Type Description Default
column str

The name of the column to update.

required
value Any

The value to set for the column.

required

Returns:

Name Type Description
FutureVoid FutureVoid

A FutureVoid object that will resolve when the operation is complete.

Examples:

future_void = row.set_async("column_name", new_value)
future_void.wait()  # Blocks until the operation is complete.
Notes
  • The method sets the value asynchronously and immediately returns a FutureVoid.
  • You can either block and wait for the operation to complete using wait() or await the FutureVoid object in an asynchronous context.

deeplake.RowView

Provides access to a particular row in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

future = row_view.get_async("column_name")
column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

row_id property

row_id: int

The row_id of the row

deeplake.RowRange

Provides mutable access to a row range in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

__iter__

__iter__() -> Iterator[Row]

Iterate over the row range

__len__

__len__() -> int

The number of rows in the row range

__setitem__

__setitem__(column: str, value: Any) -> None

Change the value for the given column

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

future = row_range.get_async("column_name")
column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

set_async

set_async(column: str, value: Any) -> FutureVoid

Asynchronously sets a value for the specified column and returns a FutureVoid object.

Parameters:

Name Type Description Default
column str

The name of the column to update.

required
value Any

The value to set for the column.

required

Returns:

Name Type Description
FutureVoid FutureVoid

A FutureVoid object that will resolve when the operation is complete.

Examples:

future_void = row_range.set_async("column_name", new_value)
future_void.wait()  # Blocks until the operation is complete.
Notes
  • The method sets the value asynchronously and immediately returns a FutureVoid.
  • You can either block and wait for the operation to complete using wait() or await the FutureVoid object in an asynchronous context.

summary

summary() -> None

Prints a summary of the RowRange.

deeplake.RowRangeView

Provides access to a row range in a dataset.

__getitem__

__getitem__(column: str) -> Any

The value for the given column

__iter__

__iter__() -> Iterator[RowView]

Iterate over the row range

__len__

__len__() -> int

The number of rows in the row range

get_async

get_async(column: str) -> Future

Asynchronously retrieves data for the specified column and returns a Future object.

Parameters:

Name Type Description Default
column str

The name of the column to retrieve data for.

required

Returns:

Name Type Description
Future Future

A Future object that will resolve to the value containing the column data.

Examples:

future = row_range_view.get_async("column_name")
column = future.result()  # Blocking call to get the result when it's ready.
Notes
  • The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
  • You can either wait for the result using future.result() (a blocking call) or use the Future in an await expression.

summary

summary() -> None

Prints a summary of the RowRange.

deeplake.Future

A future representing an asynchronous operation result in ML pipelines.

The Future class enables non-blocking operations for data loading and processing, particularly useful when working with large ML datasets or distributed training. Once resolved, the Future holds the operation result which can be accessed either synchronously or asynchronously.

Methods:

Name Description
result

Blocks until the Future resolves and returns the result.

__await__

Enables using the Future in async/await syntax.

is_completed

Checks if the Future has resolved without blocking.

Examples:

Loading ML dataset asynchronously:

future = deeplake.open_async("s3://ml-data/embeddings")

# Check status without blocking
if not future.is_completed():
    print("Still loading...")

# Block until ready
ds = future.result()

Using with async/await:

async def load_data():
    ds = await deeplake.open_async("s3://ml-data/images")
    batch = await ds.images.get_async(slice(0, 32))
    return batch

__await__

__await__() -> Any

Makes the Future compatible with async/await syntax.

Examples:

async def load_batch():
    batch = await ds["images"].get_async(slice(0, 32))

Returns:

Type Description
Any

typing.Any: The operation result once resolved.

is_completed

is_completed() -> bool

Checks if the Future has resolved without blocking.

Returns:

Name Type Description
bool bool

True if resolved, False if still pending.

Examples:

future = ds.query_async("SELECT * WHERE label = 'car'")
if future.is_completed():
    results = future.result()
else:
    print("Query still running...")

result

result() -> Any

Blocks until the Future resolves and returns the result.

Returns:

Type Description
Any

typing.Any: The operation result once resolved.

Examples:

future = ds["images"].get_async(slice(0, 32)) 
batch = future.result()  # Blocks until batch is loaded

deeplake.FutureVoid

A Future representing a void async operation in ML pipelines.

Similar to Future but for operations that don't return values, like saving or committing changes. Useful for non-blocking data management operations.

Methods:

Name Description
wait

Blocks until operation completes.

__await__

Enables using with async/await syntax.

is_completed

Checks completion status without blocking.

Examples:

Asynchronous dataset updates:

# Update embeddings without blocking
future = ds["embeddings"].set_async(slice(0, 32), new_embeddings)

# Do other work while update happens
process_other_data()

# Wait for update to complete
future.wait()

Using with async/await:

async def update_dataset():
    await ds.commit_async()
    print("Changes saved")

__await__

__await__() -> None

Makes the FutureVoid compatible with async/await syntax.

Examples:

async def save_changes():
    await ds.commit_async()

is_completed

is_completed() -> bool

Checks if the operation has completed without blocking.

Returns:

Name Type Description
bool bool

True if completed, False if still running.

Examples:

future = ds.commit_async()
if future.is_completed():
    print("Commit finished")
else:
    print("Commit still running...")

wait

wait() -> None

Blocks until the operation completes.

Examples:

future = ds.commit_async()
future.wait()  # Blocks until commit finishes