Skip to content

Dataset Classes

Deep Lake provides three dataset classes with different access levels:

Class Description
Dataset Full read-write access with all operations
ReadOnlyDataset Read-only access to prevent modifications
DatasetView Read-only view of query results

Creation Methods

deeplake.create

create(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
    schema: SchemaTemplate | None = None,
) -> Dataset

Creates a new dataset at the given URL.

To open an existing dataset, use deeplake.open

Parameters:

Name Type Description Default
url str

The URL of the dataset. URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None
schema dict

The initial schema to use for the dataset. See deeplake.schema such as deeplake.schemas.TextEmbeddings for common starting schemas.

None

Examples:

# Create a dataset in your local filesystem:
ds = deeplake.create("directory_path")
ds.add_column("id", types.Int32())
ds.add_column("url", types.Text())
ds.add_column("embedding", types.Embedding(768))
ds.commit()
ds.summary()
# Create dataset in your app.activeloop.ai organization:
ds = deeplake.create("al://organization_id/dataset_name")

# Create a dataset stored in your cloud using specified credentials:
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.create("azure://bucket/path/to/dataset")

ds = deeplake.create("gcs://bucket/path/to/dataset")

ds = deeplake.create("mem://in-memory")

Raises:

Type Description
ValueError

if a dataset already exists at the given URL

deeplake.open

open(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Opens an existing dataset, potenitally for modifying its content.

See deeplake.open_read_only for opening the dataset in read only mode

To create a new dataset, see deeplake.create

Parameters:

Name Type Description Default
url str

The URL of the dataset. URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

# Load dataset managed by Deep Lake.
ds = deeplake.open("al://organization_id/dataset_name")

# Load dataset stored in your cloud using your own credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Load dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.open("s3://bucket/path/to/dataset")

ds = deeplake.open("azure://bucket/path/to/dataset")

ds = deeplake.open("gcs://bucket/path/to/dataset")

deeplake.open_read_only

open_read_only(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> ReadOnlyDataset

Opens an existing dataset in read-only mode.

See deeplake.open for opening datasets for modification.

Parameters:

Name Type Description Default
url str

The URL of the dataset.

URLs can be specified using the following protocols:

  • file://path local filesystem storage
  • al://org_id/dataset_name A dataset on app.activeloop.ai
  • azure://bucket/path or az://bucket/path Azure storage
  • gs://bucket/path or gcs://bucket/path or gcp://bucket/path Google Cloud storage
  • s3://bucket/path S3 storage
  • mem://name In-memory storage that lasts the life of the process

A URL without a protocol is assumed to be a file:// URL

required
creds (dict, str)

The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
None
token str

Activeloop token to authenticate user.

None

Examples:

ds = deeplake.open_read_only("directory_path")
ds.summary()

Example Output:
Dataset length: 5
Columns:
  id       : int32
  url      : text
  embedding: embedding(768)

ds = deeplake.open_read_only("file:///path/to/dataset")

ds = deeplake.open_read_only("s3://bucket/path/to/dataset")

ds = deeplake.open_read_only("azure://bucket/path/to/dataset")

ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")

ds = deeplake.open_read_only("mem://in-memory")

deeplake.like

like(
    src: DatasetView,
    dest: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Creates a new dataset by copying the source dataset's structure to a new location.

Note

No data is copied.

Parameters:

Name Type Description Default
src DatasetView

The dataset to copy the structure from.

required
dest str

The URL to create the new dataset at. creds (dict, str, optional): The string ENV or a dictionary containing credentials used to access the dataset at the path.

  • If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths.
  • It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys.
  • To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set.
  • If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets
required
token str

Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.

None

Examples:

ds = deeplake.like(src="az://bucket/existing/to/dataset",
   dest="s3://bucket/new/dataset")

Dataset Class

The main class providing full read-write access.

deeplake.Dataset

Bases: DatasetView

Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.

Unlike deeplake.ReadOnlyDataset, instances of Dataset can be modified.

add_column

add_column(
    name: str,
    dtype: DataType | str | Type | type | Callable,
    format: DataFormat | None = None,
) -> None

Add a new column to the dataset.

Any existing rows in the dataset will have a None value for the new column

Parameters:

Name Type Description Default
name str

The name of the column

required
dtype DataType | str | Type | type | Callable

The type of the column. Possible values include:

  • Values from deeplake.types such as "[deeplake.types.Int32][]()"
  • Python types: str, int, float
  • Numpy types: such as np.int32
  • A function reference that returns one of the above types
required
format DataFormat

The format of the column, if applicable. Only required when the dtype is [deeplake.types.DataType][].

None

Examples:

ds.add_column("labels", deeplake.types.Int32)

ds.add_column("categories", "int32")

ds.add_column("name", deeplake.types.Text())

ds.add_column("json_data", deeplake.types.Dict())

ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))

ds.add_column("embedding", deeplake.types.Embedding(size=768))

Raises:

Type Description
ColumnAlreadyExistsError

If a column with the same name already exists.

append

append(data: list[dict[str, Any]]) -> None
append(data: dict[str, Any]) -> None
append(data: DatasetView) -> None
append(
    data: (
        list[dict[str, Any]] | dict[str, Any] | DatasetView
    )
) -> None

Adds data to the dataset.

The data can be in a variety of formats:

  • A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
  • A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
  • A DatasetView that was generated through any mechanism

Parameters:

Name Type Description Default
data list[dict[str, Any]] | dict[str, Any] | DatasetView

The data to insert into the dataset.

required

Examples:

ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})

ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])
ds2.append({
    "embedding": np.random.rand(4, 768),
    "text": ["Hello World"] * 4})

ds2.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)
ds2.append(deeplake.from_parquet("./file.parquet"))

Raises:

Type Description
ColumnMissingAppendValueError

If any column is missing from the input data.

UnevenColumnsError

If the input data columns are not the same length.

InvalidTypeDimensions

If the input data does not match the column's dimensions.

commit

commit(message: str | None = None) -> None

Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.

Parameters:

Name Type Description Default
message str

A message to store in history describing the changes made in the version

None

Examples:

ds.commit()

ds.commit("Added data from updated documents")

commit_async

commit_async(message: str | None = None) -> FutureVoid

Asynchronously commits changes you have made to the dataset.

See deeplake.Dataset.commit for more information.

Parameters:

Name Type Description Default
message str

A message to store in history describing the changes made in the commit

None

Examples:

ds.commit_async().wait()

ds.commit_async("Added data from updated documents").wait()

async def do_commit():
    await ds.commit_async()

future = ds.commit_async() # then you can check if the future is completed using future.is_completed()

tag

tag(name: str, version: str | None = None) -> Tag

Tags a version of the dataset. If no version is given, the current version is tagged.

Parameters:

Name Type Description Default
name str

The name of the tag

required
version str | None

The version of the dataset to tag

None

push

push(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pushes any new history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

pull

pull(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push but the other direction.

Parameters:

Name Type Description Default
url str

The URL of the destination dataset

required
creds dict[str, str] | None

Optional credentials needed to connect to the dataset

None
token str | None

Optional deeplake token

None

query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

metadata property

metadata: Metadata

The metadata of the dataset.

description instance-attribute

description: str

The description of the dataset. Setting the value will immediately persist the change without requiring a commit().

version property

version: str

The currently checked out version of the dataset

history property

history: History

This dataset's version history

schema property

schema: Schema

The schema of the dataset.

indexing_mode instance-attribute

indexing_mode: IndexingMode

The indexing mode of the dataset. This property can be set to change the indexing mode of the dataset for the current session, other sessions will not be affected.

<!-- test-context
```python
import deeplake
ds = deeplake.create("tmp://")
ds.indexing_mode = deeplake.IndexingMode.Off
ds.add_column("column_name", deeplake.types.Text(deeplake.types.BM25))
a = ['a']*10_000
ds.append({"column_name":a})
ds.commit()
```
-->

Examples:
    ```python
    ds = deeplake.open("tmp://")
    ds.indexing_mode = deeplake.IndexingMode.Automatic
    ds.commit()
    ```

ReadOnlyDataset Class

Read-only version of Dataset. Cannot modify data but provides access to all data and metadata.

deeplake.ReadOnlyDataset

Bases: DatasetView

query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

metadata property

metadata: ReadOnlyMetadata

The metadata of the dataset.

version property

version: str

The currently checked out version of the dataset

history property

history: History

The history of the overall dataset configuration.

schema property

schema: SchemaView

The schema of the dataset.

DatasetView Class

Lightweight view returned by queries. Provides read-only access to query results.

deeplake.DatasetView

A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.

query

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

tensorflow

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type Description
ImportError

If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

pytorch

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name Type Description Default
transform Callable[[Any], Any]

A custom function to apply to each sample before returning it

None

Raises:

Type Description
ImportError

If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

summary

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

Class Comparison

Dataset

  • Full read-write access
  • Can create/modify columns
  • Can append/update data
  • Can commit changes
  • Can create version tags
  • Can push/pull changes
ds = deeplake.create("s3://bucket/dataset")
# or
ds = deeplake.open("s3://bucket/dataset")

# Can modify
ds.add_column("images", deeplake.types.Image())
ds.add_column("labels", deeplake.types.ClassLabel("int32"))
ds.add_column("confidence", "float32")
ds["labels"].metadata["class_names"] = ["cat", "dog"]   
ds.append([{"images": image_array, "labels": 0, "confidence": 0.9}])
ds.commit()

ReadOnlyDataset

  • Read-only access
  • Cannot modify data or schema
  • Can view all data and metadata
  • Can execute queries
  • Returned by open_read_only()
ds = deeplake.open_read_only("s3://bucket/dataset")

# Can read
image = ds["images"][0]
metadata = ds.metadata

# Cannot modify
# ds.append([...])  # Would raise error

DatasetView

  • Read-only access
  • Cannot modify data
  • Optimized for query results
  • Direct integration with ML frameworks
  • Returned by query()
# Get view through query
view = ds.query("SELECT *")

# Access data
image = view["images"][0]

# ML framework integration
torch_dataset = view.pytorch()
tf_dataset = view.tensorflow()

Examples

Querying Data

# Using Dataset
ds = deeplake.open("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")

# Using ReadOnlyDataset
ds = deeplake.open_read_only("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")

# Using DatasetView
view = ds.query("SELECT * WHERE labels = 'cat'")
subset = view.query("SELECT * WHERE confidence > 0.9")

Data Access

# Common access patterns work on all types
for row in ds:  # Works for Dataset, ReadOnlyDataset, and DatasetView
    image = row["images"]
    label = row["labels"]

# Column access works on all types
images = ds["images"][:]
labels = ds["labels"][:]

Async Operations

# Async query works on all types
future = ds.query_async("SELECT * WHERE labels = 'cat'")
results = future.result()

# Async data access
future = ds["images"].get_async(slice(0, 1000))
images = future.result()