Dataset Classes¶

Deep Lake provides three dataset classes with different access levels:

Class	Description
Dataset	Full read-write access with all operations
ReadOnlyDataset	Read-only access to prevent modifications
DatasetView	Read-only view of query results

Creation Methods¶

deeplake.create ¶

create(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
    schema: dict[str, DataType | str | Type] | None = None,
) -> Dataset

Creates a new dataset at the given URL.

To open an existing dataset, use deeplake.open

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the dataset. URLs can be specified using the following protocols: `file://path` local filesystem storage `al://org_id/dataset_name` A dataset on app.activeloop.ai `azure://bucket/path` or `az://bucket/path` Azure storage `gs://bucket/path` or `gcs://bucket/path` or `gcp://bucket/path` Google Cloud storage `s3://bucket/path` S3 storage `mem://name` In-memory storage that lasts the life of the process A URL without a protocol is assumed to be a file:// URL	required
`creds`	`(dict, str)`	The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`
`token`	`str`	Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.	`None`
`schema`	`dict`	The initial schema to use for the dataset. See `deeplake.schema` such as deeplake.schemas.TextEmbeddings for common starting schemas.	`None`

Examples:

# Create a dataset in your local filesystem:
ds = deeplake.create("directory_path")
ds.add_column("id", types.Int32())
ds.add_column("url", types.Text())
ds.add_column("embedding", types.Embedding(768))
ds.commit()
ds.summary()

# Create dataset in your app.activeloop.ai organization:
ds = deeplake.create("al://organization_id/dataset_name")

# Create a dataset stored in your cloud using specified credentials:
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.create("azure://bucket/path/to/dataset")

ds = deeplake.create("gcs://bucket/path/to/dataset")

ds = deeplake.create("mem://in-memory")

Raises:

Type	Description
`LogExistsError`	if a dataset already exists at the given URL

deeplake.open ¶

open(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Opens an existing dataset, potenitally for modifying its content.

See deeplake.open_read_only for opening the dataset in read only mode

To create a new dataset, see deeplake.create

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the dataset. URLs can be specified using the following protocols: `file://path` local filesystem storage `al://org_id/dataset_name` A dataset on app.activeloop.ai `azure://bucket/path` or `az://bucket/path` Azure storage `gs://bucket/path` or `gcs://bucket/path` or `gcp://bucket/path` Google Cloud storage `s3://bucket/path` S3 storage `mem://name` In-memory storage that lasts the life of the process A URL without a protocol is assumed to be a file:// URL	required
`creds`	`(dict, str)`	The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`
`token`	`str`	Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.	`None`

Examples:

# Load dataset managed by Deep Lake.
ds = deeplake.open("al://organization_id/dataset_name")

# Load dataset stored in your cloud using your own credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Load dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.open("s3://bucket/path/to/dataset")

ds = deeplake.open("azure://bucket/path/to/dataset")

ds = deeplake.open("gcs://bucket/path/to/dataset")

deeplake.open_read_only ¶

open_read_only(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> ReadOnlyDataset

Opens an existing dataset in read-only mode.

See deeplake.open for opening datasets for modification.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the dataset. URLs can be specified using the following protocols: `file://path` local filesystem storage `al://org_id/dataset_name` A dataset on app.activeloop.ai `azure://bucket/path` or `az://bucket/path` Azure storage `gs://bucket/path` or `gcs://bucket/path` or `gcp://bucket/path` Google Cloud storage `s3://bucket/path` S3 storage `mem://name` In-memory storage that lasts the life of the process A URL without a protocol is assumed to be a file:// URL	required
`creds`	`(dict, str)`	The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`
`token`	`str`	Activeloop token to authenticate user.	`None`

Examples:

ds = deeplake.open_read_only("directory_path")
ds.summary()

Example Output:
Dataset length: 5
Columns:
  id       : int32
  url      : text
  embedding: embedding(768)

ds = deeplake.open_read_only("file:///path/to/dataset")

ds = deeplake.open_read_only("s3://bucket/path/to/dataset")

ds = deeplake.open_read_only("azure://bucket/path/to/dataset")

ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")

ds = deeplake.open_read_only("mem://in-memory")

deeplake.like ¶

like(
    src: DatasetView,
    dest: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Creates a new dataset by copying the source dataset's structure to a new location.

Note

No data is copied.

Parameters:

Name	Type	Description	Default
`src`	`DatasetView`	The dataset to copy the structure from.	required
`dest`	`str`	The URL to create the new dataset at. creds (dict, str, optional): The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	required
`token`	`str`	Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.	`None`

Examples:

ds = deeplake.like(src="az://bucket/existing/to/dataset",
   dest="s3://bucket/new/dataset")

Dataset Class¶

The main class providing full read-write access.

deeplake.Dataset ¶

Bases: DatasetView

Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.

Unlike deeplake.ReadOnlyDataset, instances of Dataset can be modified.

add_column ¶

add_column(
    name: str,
    dtype: DataType | str | Type | type | Callable,
    default_value: Any = None,
) -> None

Add a new column to the dataset.

Any existing rows in the dataset will have a None value for the new column

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the column	required
`dtype`	`DataType \| str \| Type \| type \| Callable`	The type of the column. Possible values include: Values from `deeplake.types` such as "[deeplake.types.Int32][]()" Python types: `str`, `int`, `float` Numpy types: such as `np.int32` A function reference that returns one of the above types	required
`format`	`DataFormat`	The format of the column, if applicable. Only required when the dtype is [deeplake.types.DataType][].	required

Examples:

ds.add_column("labels", deeplake.types.Int32)

ds.add_column("categories", "int32")

ds.add_column("name", deeplake.types.Text())

ds.add_column("json_data", deeplake.types.Dict())

ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))

ds.add_column("embedding", deeplake.types.Embedding(size=768))

Raises:

Type	Description
`ColumnAlreadyExistsError`	If a column with the same name already exists.

append ¶

append(data: list[dict[str, Any]]) -> None

append(data: dict[str, Any]) -> None

append(data: DatasetView) -> None

append(
    data: (
        list[dict[str, Any]] | dict[str, Any] | DatasetView
    ),
) -> None

Adds data to the dataset.

The data can be in a variety of formats:

A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
A DatasetView that was generated through any mechanism

Parameters:

Name	Type	Description	Default
`data`	`list[dict[str, Any]] \| dict[str, Any] \| DatasetView`	The data to insert into the dataset.	required

Examples:

ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})

ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])

ds2.append({
    "embedding": np.random.rand(4, 768),
    "text": ["Hello World"] * 4})

ds2.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)

ds2.append(deeplake.from_parquet("./file.parquet"))

Raises:

Type	Description
`ColumnMissingAppendValueError`	If any column is missing from the input data.
`UnevenColumnsError`	If the input data columns are not the same length.
`InvalidTypeDimensions`	If the input data does not match the column's dimensions.

commit ¶

commit(message: str | None = None) -> None

Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.

Parameters:

Name	Type	Description	Default
`message`	`str`	A message to store in history describing the changes made in the version	`None`

Examples:

ds.commit()

ds.commit("Added data from updated documents")

commit_async ¶

commit_async(message: str | None = None) -> FutureVoid

Asynchronously commits changes you have made to the dataset.

See deeplake.Dataset.commit for more information.

Parameters:

Name	Type	Description	Default
`message`	`str`	A message to store in history describing the changes made in the commit	`None`

Examples:

ds.commit_async().wait()

ds.commit_async("Added data from updated documents").wait()

async def do_commit():
    await ds.commit_async()

future = ds.commit_async() # then you can check if the future is completed using future.is_completed()

branch ¶

branch(name: str, version: str | None = None) -> Branch

Creates a branch with the given version of the current branch. If no version is given, the current version will be picked up.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the branch	required
`version`	`str \| None`	The version of the dataset	`None`

branches `property` ¶

branches: Branches

The collection of deeplake.Branchs within the dataset

tag ¶

tag(name: str, version: str | None = None) -> Tag

Tags a version of the dataset. If no version is given, the current version is tagged.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the tag	required
`version`	`str \| None`	The version of the dataset to tag	`None`

tags `property` ¶

tags: Tags

The collection of deeplake.Tags within the dataset

push ¶

push(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pushes any new history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull but the other direction.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the destination dataset	required
`creds`	`dict[str, str] \| None`	Optional credentials needed to connect to the dataset	`None`
`token`	`str \| None`	Optional deeplake token	`None`

pull ¶

pull(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push but the other direction.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the destination dataset	required
`creds`	`dict[str, str] \| None`	Optional credentials needed to connect to the dataset	`None`
`token`	`str \| None`	Optional deeplake token	`None`

query ¶

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async ¶

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

metadata `property` ¶

metadata: Metadata

The metadata of the dataset.

description `instance-attribute` ¶

description: str

The description of the dataset. Setting the value will immediately persist the change without requiring a commit().

version `property` ¶

version: str

The currently checked out version of the dataset

history `property` ¶

history: History

This dataset's version history

schema `property` ¶

schema: Schema

The schema of the dataset.

indexing_mode `instance-attribute` ¶

indexing_mode: IndexingMode

The indexing mode of the dataset. This property can be set to change the indexing mode of the dataset for the current session, other sessions will not be affected.

Examples:

ds = deeplake.open("mem://ds_id")
ds.indexing_mode = deeplake.IndexingMode.Automatic
ds.commit()

ReadOnlyDataset Class¶

Read-only version of Dataset. Cannot modify data but provides access to all data and metadata.

deeplake.ReadOnlyDataset ¶

Bases: DatasetView

query ¶

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async ¶

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

branches `property` ¶

branches: BranchesView

The collection of deeplake.BranchViews within the dataset

tags `property` ¶

tags: TagsView

The collection of deeplake.TagViews within the dataset

metadata `property` ¶

metadata: ReadOnlyMetadata

The metadata of the dataset.

version `property` ¶

version: str

The currently checked out version of the dataset

history `property` ¶

history: History

The history of the overall dataset configuration.

schema `property` ¶

schema: SchemaView

The schema of the dataset.

DatasetView Class¶

Lightweight view returned by queries. Provides read-only access to query results.

deeplake.DatasetView ¶

A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.

query ¶

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async ¶

query_async(query: str) -> Future

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

tag ¶

tag(name: str | None = None) -> Tag

Saves the current view as a tag to its source dataset and returns the tag.

batches ¶

batches(
    batch_size: int, drop_last: bool = False
) -> Iterable

The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Number of rows in each batch	required
`drop_last`	`bool`	Whether to drop the final batch if it is incomplete	`False`

Examples:

batches = ds.batches(batch_size=2000, drop_last=True)
for batch in batches:
    process_batch(batch["images"])

tensorflow ¶

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type	Description
`ImportError`	If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

pytorch ¶

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name	Type	Description	Default
`transform`	`Callable[[Any], Any]`	A custom function to apply to each sample before returning it	`None`

Raises:

Type	Description
`ImportError`	If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

summary ¶

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

Class Comparison¶

Dataset¶

Full read-write access
Can create/modify columns
Can append/update data
Can commit changes
Can create version tags
Can push/pull changes

ds = deeplake.create("s3://bucket/dataset")
# or
ds = deeplake.open("s3://bucket/dataset")

# Can modify
ds.add_column("images", deeplake.types.Image())
ds.add_column("labels", deeplake.types.ClassLabel("int32"))
ds.add_column("confidence", "float32")
ds["labels"].metadata["class_names"] = ["cat", "dog"]   
ds.append([{"images": image_array, "labels": 0, "confidence": 0.9}])
ds.commit()

ReadOnlyDataset¶

Read-only access
Cannot modify data or schema
Can view all data and metadata
Can execute queries
Returned by open_read_only()

ds = deeplake.open_read_only("s3://bucket/dataset")

# Can read
image = ds["images"][0]
metadata = ds.metadata

# Cannot modify
# ds.append([...])  # Would raise error

DatasetView¶

Read-only access
Cannot modify data
Optimized for query results
Direct integration with ML frameworks
Returned by query()

# Get view through query
view = ds.query("SELECT *")

# Access data
image = view["images"][0]

# ML framework integration
torch_dataset = view.pytorch()
tf_dataset = view.tensorflow()

Examples¶

Querying Data¶

# Using Dataset
ds = deeplake.open("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")

# Using ReadOnlyDataset
ds = deeplake.open_read_only("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")

# Using DatasetView
view = ds.query("SELECT * WHERE labels = 'cat'")
subset = view.query("SELECT * WHERE confidence > 0.9")

Data Access¶

# Common access patterns work on all types
for row in ds:  # Works for Dataset, ReadOnlyDataset, and DatasetView
    image = row["images"]
    label = row["labels"]

# Column access works on all types
images = ds["images"][:]
labels = ds["labels"][:]

Async Operations¶

# Async query works on all types
future = ds.query_async("SELECT * WHERE labels = 'cat'")
results = future.result()

# Async data access
future = ds["images"].get_async(slice(0, 1000))
images = future.result()

Dataset Classes¶

Creation Methods¶

deeplake.create ¶

deeplake.open ¶

deeplake.open_read_only ¶

deeplake.like ¶

Dataset Class¶

deeplake.Dataset ¶

add_column ¶

append ¶

commit ¶

commit_async ¶

branch ¶

branches property ¶

tag ¶

tags property ¶

push ¶

pull ¶

query ¶

query_async ¶

metadata property ¶

description instance-attribute ¶

version property ¶

history property ¶

schema property ¶

indexing_mode instance-attribute ¶

ReadOnlyDataset Class¶

deeplake.ReadOnlyDataset ¶

query ¶

query_async ¶

branches property ¶

tags property ¶

metadata property ¶

version property ¶

history property ¶

schema property ¶

DatasetView Class¶

deeplake.DatasetView ¶

query ¶

query_async ¶

tag ¶

batches ¶

tensorflow ¶

pytorch ¶

summary ¶

Class Comparison¶

Dataset¶

ReadOnlyDataset¶

DatasetView¶

Examples¶

Querying Data¶

Data Access¶

Async Operations¶

branches `property` ¶

tags `property` ¶

metadata `property` ¶

description `instance-attribute` ¶

version `property` ¶

history `property` ¶

schema `property` ¶

indexing_mode `instance-attribute` ¶

branches `property` ¶

tags `property` ¶

metadata `property` ¶

version `property` ¶

history `property` ¶

schema `property` ¶