Dataset Classes¶

Deep Lake provides three dataset classes with different access levels:

Class	Description
Dataset	Full read-write access with all operations
ReadOnlyDataset	Read-only access to prevent modifications
DatasetView	Read-only view of query results

Creation Methods¶

deeplake ¶

create ¶

create(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
    schema: dict[str, Any] | Any | None = None,
) -> Dataset

Creates a new dataset at the given URL.

To open an existing dataset, use deeplake.open

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the dataset. URLs can be specified using the following protocols: `file://path` local filesystem storage `al://org_id/dataset_name` A dataset on app.activeloop.ai `azure://bucket/path` or `az://bucket/path` Azure storage `gs://bucket/path` or `gcs://bucket/path` or `gcp://bucket/path` Google Cloud storage `s3://bucket/path` S3 storage `mem://name` In-memory storage that lasts the life of the process A URL without a protocol is assumed to be a file:// URL	required
`creds`	`(dict, str)`	The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`
`token`	`str`	Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.	`None`
`schema`	`dict`	The initial schema to use for the dataset. See `deeplake.schema` such as deeplake.schemas.TextEmbeddings for common starting schemas.	`None`

Examples:

# Create a dataset in your local filesystem:
ds = deeplake.create("directory_path")
ds.add_column("id", types.Int32())
ds.add_column("url", types.Text())
ds.add_column("embedding", types.Embedding(768))
ds.commit()
ds.summary()

# Create dataset in your app.activeloop.ai organization:
ds = deeplake.create("al://organization_id/dataset_name")

# Create a dataset stored in your cloud using specified credentials:
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = deeplake.create("s3://mybucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.create("azure://bucket/path/to/dataset")

ds = deeplake.create("gcs://bucket/path/to/dataset")

ds = deeplake.create("mem://in-memory")

Raises:

Type	Description
`LogExistsError`	if a dataset already exists at the given URL

like ¶

like(
    src: DatasetView,
    dest: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Creates a new dataset by copying the source dataset's structure to a new location.

Note

No data is copied.

Parameters:

Name	Type	Description	Default
`src`	`DatasetView`	The dataset to copy the structure from.	required
`dest`	`str`	The URL to create the new dataset at. creds (dict, str, optional): The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	required
`token`	`str`	Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.	`None`

Examples:

ds = deeplake.like(src="az://bucket/existing/to/dataset",
   dest="s3://bucket/new/dataset")

from_parquet ¶

from_parquet(url_or_bytes: bytes | str) -> ReadOnlyDataset

Opens a Parquet dataset in the deeplake format.

Parameters:

Name	Type	Description	Default
`url_or_bytes`	`bytes \| str`	The URL of the Parquet dataset or bytes of the Parquet file. If no protocol is specified, it assumes `file://`	required

from_csv ¶

from_csv(url_or_bytes: bytes | str) -> ReadOnlyDataset

Opens a CSV dataset in the deeplake format.

Parameters:

Name	Type	Description	Default
`url_or_bytes`	`bytes \| str`	The URL of the CSV dataset or bytes of the CSV file. If no protocol is specified, it assumes `file://`	required

open ¶

open(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> Dataset

Opens an existing dataset, potenitally for modifying its content.

See deeplake.open_read_only for opening the dataset in read only mode

To create a new dataset, see deeplake.create

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the dataset. URLs can be specified using the following protocols: `file://path` local filesystem storage `al://org_id/dataset_name` A dataset on app.activeloop.ai `azure://bucket/path` or `az://bucket/path` Azure storage `gs://bucket/path` or `gcs://bucket/path` or `gcp://bucket/path` Google Cloud storage `s3://bucket/path` S3 storage `mem://name` In-memory storage that lasts the life of the process A URL without a protocol is assumed to be a file:// URL	required
`creds`	`(dict, str)`	The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`
`token`	`str`	Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.	`None`

Examples:

# Load dataset managed by Deep Lake.
ds = deeplake.open("al://organization_id/dataset_name")

# Load dataset stored in your cloud using your own credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"aws_access_key_id": id, "aws_secret_access_key": key})

# Load dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.open("s3://bucket/my_dataset",
    creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")

ds = deeplake.open("s3://bucket/path/to/dataset")

ds = deeplake.open("azure://bucket/path/to/dataset")

ds = deeplake.open("gcs://bucket/path/to/dataset")

open_read_only ¶

open_read_only(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> ReadOnlyDataset

Opens an existing dataset in read-only mode.

See deeplake.open for opening datasets for modification.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the dataset. URLs can be specified using the following protocols: `file://path` local filesystem storage `al://org_id/dataset_name` A dataset on app.activeloop.ai `azure://bucket/path` or `az://bucket/path` Azure storage `gs://bucket/path` or `gcs://bucket/path` or `gcp://bucket/path` Google Cloud storage `s3://bucket/path` S3 storage `mem://name` In-memory storage that lasts the life of the process A URL without a protocol is assumed to be a file:// URL	required
`creds`	`(dict, str)`	The string `ENV` or a dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. To use credentials managed in your Activeloop organization, use they key 'creds_key': 'managed_key_name'. This requires the org_id dataset argument to be set. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`
`token`	`str`	Activeloop token to authenticate user.	`None`

Examples:

ds = deeplake.open_read_only("directory_path")
ds.summary()

Example Output:
Dataset length: 5
Columns:
  id       : int32
  url      : text
  embedding: embedding(768)

ds = deeplake.open_read_only("file:///path/to/dataset")

ds = deeplake.open_read_only("s3://bucket/path/to/dataset")

ds = deeplake.open_read_only("azure://bucket/path/to/dataset")

ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")

ds = deeplake.open_read_only("mem://in-memory")

delete ¶

delete(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Deletes an existing dataset.

Warning

This operation is irreversible. All data will be lost.

If concurrent processes are attempting to write to the dataset while it's being deleted, it may lead to data inconsistency. It's recommended to use this operation with caution.

delete_async ¶

delete_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously deletes an existing dataset.

Warning

This operation is irreversible. All data will be lost.

If concurrent processes are attempting to write to the dataset while it's being deleted, it may lead to data inconsistency. It's recommended to use this operation with caution.

exists ¶

exists(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> bool

Check if a dataset exists at the given URL

Parameters:

Name	Type	Description	Default
`url`	`str`	URL of the dataset	required
`creds`	`dict[str, str] \| None`	The string `ENV` or a dictionary containing credentials used to access the dataset at the path.	`None`
`token`	`str \| None`	Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated.	`None`

query ¶

query(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> DatasetView

Executes TQL queries optimized for ML data filtering and search.

TQL is a SQL-like query language designed for ML datasets, supporting:

Vector similarity search
Text semantic search
Complex data filtering
Joining across datasets
Efficient sorting and pagination

Parameters:

Name	Type	Description	Default
`query`	`str`	TQL query string supporting: Vector similarity: COSINE_SIMILARITY, L2_NORM Text search: BM25_SIMILARITY, CONTAINS MAXSIM similarity for ColPali embeddings: MAXSIM Filtering: WHERE clauses Sorting: ORDER BY Joins: JOIN across datasets	required
`token`	`str \| None`	Optional Activeloop authentication token	`None`
`creds`	`dict`	Dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`

Returns:

Name	Type	Description
`DatasetView`	`DatasetView`	Query results that can be: Used directly in ML training Further filtered with additional queries Converted to PyTorch/TensorFlow dataloaders Materialized into a new dataset

Examples:

Vector similarity search:

# Find similar embeddings
similar = deeplake.query('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
    LIMIT 100
''')

# Use results in training
dataloader = similar.pytorch()

Text semantic search:

# Search documents using BM25
relevant = deeplake.query('''
    SELECT * FROM "mem://documents"
    ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
    LIMIT 10
''')

Complex filtering:

# Filter training data
train = deeplake.query('''
    SELECT * FROM "mem://dataset"
    WHERE "train_split" = 'train'
    AND confidence > 0.9
    AND label IN ('cat', 'dog')
''')

Joins for feature engineering:

# Combine image features with metadata
features = deeplake.query('''
    SELECT i.image, i.embedding, m.labels, m.metadata
    FROM "mem://images" AS i
    JOIN "mem://metadata" AS m ON i.id = m.image_id
    WHERE m.verified = true
''')

query_async ¶

query_async(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> Future[DatasetView]

Asynchronously executes TQL queries optimized for ML data filtering and search.

Non-blocking version of query() for better performance with large datasets. Supports the same TQL features including vector similarity search, text search, filtering, and joins.

Parameters:

Name	Type	Description	Default
`query`	`str`	TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets	required
`token`	`str \| None`	Optional Activeloop authentication token	`None`
`creds`	`dict`	Dictionary containing credentials used to access the dataset at the path. If 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token' are present, these take precedence over credentials present in the environment or in credentials file. Currently only works with s3 paths. It supports 'aws_access_key_id', 'aws_secret_access_key', 'aws_session_token', 'endpoint_url', 'aws_region', 'profile_name' as keys. If nothing is given is, credentials are fetched from the environment variables. This is also the case when creds is not passed for cloud datasets	`None`

Returns:

Name	Type	Description
`Future`	`Future[DatasetView]`	Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset

Examples:

Basic async query:

# Run query asynchronously
future = deeplake.query_async('''
    SELECT * FROM "mem://embeddings"
    ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')

# Do other work while query runs
prepare_training()

# Get results when needed
results = future.result()

With async/await:

async def search_similar():
    results = await deeplake.query_async('''
        SELECT * FROM "mem://images"
        ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
        LIMIT 100
        ''')
    return results
async def main():
    similar = await search_similar()

Non-blocking check:

future = deeplake.query_async(
    "SELECT * FROM dataset WHERE train_split = 'train'"
)

if future.is_completed():
    train_data = future.result()
else:
    print("Query still running...")

explain_query ¶

explain_query(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> ExplainQueryResult

Explains TQL query with optional authentication.

Parameters:

Name	Type	Description	Default
`query`	`str`	TQL query string to explain	required
`token`	`str \| None`	Optional Activeloop authentication token	`None`
`creds`	`dict`	Dictionary containing credentials used to access the dataset at the path.	`None`

Returns:

Name	Type	Description
`ExplainQueryResult`	`ExplainQueryResult`	An explain result object to analyze the query.

Examples:

Explaining a query:

explain_result = deeplake.explain_query('SELECT * FROM "mem://explain_query" WHERE category == 'active'')
print(explain_result)

prepare_query ¶

prepare_query(
    query: str,
    token: str | None = None,
    creds: dict[str, str] | None = None,
) -> Executor

Prepares a TQL query for execution with optional authentication.

Parameters:

Name	Type	Description	Default
`query`	`str`	TQL query string to execute	required
`token`	`str \| None`	Optional Activeloop authentication token	`None`
`creds`	`dict`	Dictionary containing credentials used to access the dataset at the path.	`None`

Returns:

Name	Type	Description
`Executor`	`Executor`	An executor object to run the query.

Examples:

Running a parametrized batch query:

ex = deeplake.prepare_query('SELECT * FROM "mem://parametriized" WHERE category = ?')
results = ex.run_batch([["active"], ["inactive"]])
assert len(results) == 2

Dataset¶

The main class providing full read-write access.

deeplake.Dataset ¶

Bases: DatasetView

Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.

Unlike deeplake.ReadOnlyDataset, instances of Dataset can be modified.

add_column ¶

add_column(
    name: str, dtype: Any, default_value: Any = None
) -> None

Add a new column to the dataset.

Any existing rows in the dataset will have a None value for the new column

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the column	required
`default_value`	`Any`	The default value to set for existing rows. If not provided, existing rows will have a `None` value for the new column.	`None`
`dtype`	`Any`	The type of the column. Possible values include: Values from `deeplake.types` such as "deeplake.types.Int32()" Python types: `str`, `int`, `float` Numpy types: such as `np.int32` A function reference that returns one of the above types	required

Examples:

ds.add_column("labels", deeplake.types.Int32)

ds.add_column("categories", "int32")

ds.add_column("name", deeplake.types.Text())

ds.add_column("json_data", deeplake.types.Dict())

ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))

ds.add_column("embedding", deeplake.types.Embedding(size=768))

Raises:

Type	Description
`ColumnAlreadyExistsError`	If a column with the same name already exists.

remove_column ¶

remove_column(name: str) -> None

Remove the existing column from the dataset.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the column to remove	required

Examples:

ds.remove_column("name")

Raises:

Type	Description
`ColumnDoesNotExistsError`	If a column with the specified name does not exists.

rename_column ¶

rename_column(name: str, new_name: str) -> None

Renames the existing column in the dataset.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the column to rename	required
`new_name`	`str`	The new name to set to column	required

Examples:

ds.rename_column("old_name", "new_name")

Raises:

Type	Description
`ColumnDoesNotExistsError`	If a column with the specified name does not exists.
`ColumnAlreadyExistsError`	If a column with the specified new name already exists.

append ¶

append(data: list[dict[str, Any]]) -> None

append(data: dict[str, Any]) -> None

append(data: DatasetView) -> None

append(
    data: (
        list[dict[str, Any]] | dict[str, Any] | DatasetView
    ),
) -> None

Adds data to the dataset.

The data can be in a variety of formats:

A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
A DatasetView that was generated through any mechanism

Parameters:

Name	Type	Description	Default
`data`	`list[dict[str, Any]] \| dict[str, Any] \| DatasetView`	The data to insert into the dataset.	required

Examples:

ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})

ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])

ds2.append({
    "embedding": np.random.rand(4, 768),
    "text": ["Hello World"] * 4})

ds2.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)

ds2.append(deeplake.from_parquet("./file.parquet"))

Raises:

Type	Description
`ColumnMissingAppendValueError`	If any column is missing from the input data.
`UnevenColumnsError`	If the input data columns are not the same length.
`InvalidTypeDimensions`	If the input data does not match the column's dimensions.

auto_commit_enabled `instance-attribute` ¶

auto_commit_enabled: bool

This property controls whether the dataset will perform time-based auto-commits.

Examples:

ds = deeplake.open("mem://auto_commit_ds")
ds.auto_commit_enabled = True

branch ¶

branch(name: str, version: str | None = None) -> Branch

Creates a branch with the given version of the current branch. If no version is given, the current version will be picked up.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the branch	required
`version`	`str \| None`	The version of the dataset	`None`

branches `property` ¶

branches: Branches

The collection of deeplake.Branchs within the dataset

tag ¶

tag(
    name: str,
    message: str | None = None,
    version: str | None = None,
) -> Tag

Tags a version of the dataset. If no version is given, the current version is tagged.

Parameters:

Name	Type	Description	Default
`name`	`str`	The name of the tag	required
`version`	`str \| None`	The version of the dataset to tag	`None`

tags `property` ¶

tags: Tags

The collection of deeplake.Tags within the dataset

commit ¶

commit(message: str | None = None) -> None

Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.

Parameters:

Name	Type	Description	Default
`message`	`str`	A message to store in history describing the changes made in the version	`None`

Examples:

ds.commit()

ds.commit("Added data from updated documents")

commit_async ¶

commit_async(message: str | None = None) -> FutureVoid

Asynchronously commits changes you have made to the dataset.

See deeplake.Dataset.commit for more information.

Parameters:

Name	Type	Description	Default
`message`	`str`	A message to store in history describing the changes made in the commit	`None`

Examples:

ds.commit_async().wait()

ds.commit_async("Added data from updated documents").wait()

async def do_commit():
    await ds.commit_async()

future = ds.commit_async() # then you can check if the future is completed using future.is_completed()

history `property` ¶

history: History

This dataset's version history

version `property` ¶

version: str

The currently checked out version of the dataset

created_time `property` ¶

created_time: datetime

When the dataset was created. The value is auto-generated at creation time.

current_branch `property` ¶

current_branch: Branch

The current active branch

delete ¶

delete(offset: int) -> None

Delete a row from the dataset.

Parameters:

Name	Type	Description	Default
`offset`	`int`	The offset of the row within the dataset to delete	required

description `instance-attribute` ¶

description: str

The description of the dataset. Setting the value will immediately persist the change without requiring a commit().

id `property` ¶

id: str

The unique identifier of the dataset. Value is auto-generated at creation time.

indexing_mode `instance-attribute` ¶

indexing_mode: IndexingMode

The indexing mode of the dataset. This property can be set to change the indexing mode of the dataset for the current session, other sessions will not be affected.

Examples:

ds = deeplake.open("mem://indexing_mode_ds")
ds.indexing_mode = deeplake.IndexingMode.Automatic
ds.commit()

merge ¶

merge(branch_name: str, version: str | None = None) -> None

Merge the given branch into the current branch. If no version is given, the current version will be picked up.

Parameters:

Name	Type	Description	Default
`branch_name`	`str`	The name of the branch	required
`version`	`str \| None`	The version of the dataset	`None`

Examples:

ds = deeplake.create("mem://merge_branch")
ds.add_column("c1", deeplake.types.Int64())
ds.append({"c1": [1, 2, 3]})
ds.commit()

b = ds.branch("Branch1")
branch_ds = b.open()
branch_ds.append({"c1": [4, 5, 6]})
branch_ds.commit()

ds.merge("Branch1")
print(len(ds))

metadata `property` ¶

metadata: Metadata

The metadata of the dataset.

name `instance-attribute` ¶

name: str

The name of the dataset. Setting the value will immediately persist the change without requiring a commit().

pull ¶

pull(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push but the other direction.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the destination dataset	required
`creds`	`dict[str, str] \| None`	Optional credentials needed to connect to the dataset	`None`
`token`	`str \| None`	Optional deeplake token	`None`

pull_async ¶

pull_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously pulls any new history from the dataset at the passed url into this dataset.

Similar to deeplake.Dataset.push_async but the other direction.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the destination dataset	required
`creds`	`dict[str, str] \| None`	Optional credentials needed to connect to the dataset	`None`
`token`	`str \| None`	Optional deeplake token	`None`

push ¶

push(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> None

Pushes any new history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull but the other direction.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the destination dataset	required
`creds`	`dict[str, str] \| None`	Optional credentials needed to connect to the dataset	`None`
`token`	`str \| None`	Optional deeplake token	`None`

push_async ¶

push_async(
    url: str,
    creds: dict[str, str] | None = None,
    token: str | None = None,
) -> FutureVoid

Asynchronously Pushes new any history from this dataset to the dataset at the given url

Similar to deeplake.Dataset.pull_async but the other direction.

Parameters:

Name	Type	Description	Default
`url`	`str`	The URL of the destination dataset	required
`creds`	`dict[str, str] \| None`	Optional credentials needed to connect to the dataset	`None`
`token`	`str \| None`	Optional deeplake token	`None`

query ¶

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async ¶

query_async(query: str) -> Future[DatasetView]

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

prepare_query ¶

prepare_query(query: str) -> Executor

Prepares a query for execution.

Parameters:

Name	Type	Description	Default
`query`	`str`	The query to prepare	required

Returns:

Name	Type	Description
`Executor`	`Executor`	The prepared query

Examples:

executor = ds.prepare_query("select * where category == ?")
results = executor.run_batch([['active'], ['inactive'], ['not sure']])
for row in results:
    print("Id is: ", row["category"])

explain_query ¶

explain_query(query: str) -> ExplainQueryResult

Explains a query.

Parameters:

Name	Type	Description	Default
`query`	`str`	The query to explain	required

Returns:

Name	Type	Description
`ExplainQueryResult`	`ExplainQueryResult`	The result of the explanation

Examples:

explain_result = ds.explain_query("select * where category == 'inactive'")
print(explain_result)

refresh ¶

refresh() -> None

Refreshes any new info from the dataset.

Similar to deeplake.open_read_only but the lightweight way.

refresh_async ¶

refresh_async() -> FutureVoid

Asynchronously refreshes any new info from the dataset.

Similar to [deeplake.open_read_only_async][] but the lightweight way.

schema `property` ¶

schema: Schema

The schema of the dataset.

summary ¶

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

set_creds_key ¶

set_creds_key(key: str, token: str | None = None) -> None

Sets the key used to store the credentials for the dataset.

creds_key `property` ¶

creds_key: str | None

The key used to store the credentials for the dataset.

to_csv ¶

to_csv(stream: Any) -> None

Exports the dataset to a stream in CSV format.

Examples:

output = io.StringIO()
ds.to_csv(output)
print(output.getvalue())

batches ¶

batches(
    batch_size: int, drop_last: bool = False
) -> Iterable

The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Number of rows in each batch	required
`drop_last`	`bool`	Whether to drop the final batch if it is incomplete	`False`

Examples:

batches = ds.batches(batch_size=2000, drop_last=True)
for batch in batches:
    process_batch(batch["images"])

tensorflow ¶

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type	Description
`ImportError`	If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

pytorch ¶

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name	Type	Description	Default
`transform`	`Callable[[Any], Any]`	A custom function to apply to each sample before returning it	`None`

Raises:

Type	Description
`ImportError`	If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

Read-only version of Dataset. Cannot modify data but provides access to all data and metadata.

deeplake.ReadOnlyDataset ¶

Bases: DatasetView

branches `property` ¶

branches: BranchesView

The collection of deeplake.BranchViews within the dataset

current_branch `property` ¶

current_branch: BranchView

The current active branch

tags `property` ¶

tags: TagsView

The collection of deeplake.TagViews within the dataset

tag ¶

tag(
    name: str | None = None, message: str | None = None
) -> Tag

Saves the current view as a tag to its source dataset and returns the tag.

description `property` ¶

description: str

The description of the dataset

history `property` ¶

history: History

The history of the overall dataset configuration.

version `property` ¶

version: str

The currently checked out version of the dataset

created_time `property` ¶

created_time: datetime

When the dataset was created. The value is auto-generated at creation time.

id `property` ¶

id: str

The unique identifier of the dataset. Value is auto-generated at creation time.

metadata `property` ¶

metadata: ReadOnlyMetadata

The metadata of the dataset.

name `property` ¶

name: str

The name of the dataset.

query ¶

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async ¶

query_async(query: str) -> Future[DatasetView]

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

refresh ¶

refresh() -> None

Refreshes any new info from the dataset.

Similar to deeplake.open_read_only but the lightweight way.

refresh_async ¶

refresh_async() -> FutureVoid

Asynchronously refreshes any new info from the dataset.

Similar to [deeplake.open_read_only_async][] but the lightweight way.

schema `property` ¶

schema: SchemaView

The schema of the dataset.

explain_query ¶

explain_query(query: str) -> ExplainQueryResult

Explains a query.

Parameters:

Name	Type	Description	Default
`query`	`str`	The query to explain	required

Returns:

Name	Type	Description
`ExplainQueryResult`	`ExplainQueryResult`	The result of the explanation

Examples:

explain_result = ds.explain_query("select * where category == 'inactive'")
print(explain_result)

prepare_query ¶

prepare_query(query: str) -> Executor

Prepares a query for execution.

Parameters:

Name	Type	Description	Default
`query`	`str`	The query to prepare	required

Returns:

Name	Type	Description
`Executor`	`Executor`	The prepared query

Examples:

executor = ds.prepare_query("select * where category == ?")
results = executor.run_batch([['active'], ['inactive'], ['not sure']])
for row in results:
    print("Id is: ", row["category"])

summary ¶

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

to_csv ¶

to_csv(stream: Any) -> None

Exports the dataset to a stream in CSV format.

Examples:

output = io.StringIO()
ds.to_csv(output)
print(output.getvalue())

tensorflow ¶

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type	Description
`ImportError`	If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

pytorch ¶

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name	Type	Description	Default
`transform`	`Callable[[Any], Any]`	A custom function to apply to each sample before returning it	`None`

Raises:

Type	Description
`ImportError`	If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

batches ¶

batches(
    batch_size: int, drop_last: bool = False
) -> Iterable

The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Number of rows in each batch	required
`drop_last`	`bool`	Whether to drop the final batch if it is incomplete	`False`

Examples:

batches = ds.batches(batch_size=2000, drop_last=True)
for batch in batches:
    process_batch(batch["images"])

Lightweight view returned by queries. Provides read-only access to query results.

deeplake.DatasetView ¶

A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.

batches ¶

batches(
    batch_size: int, drop_last: bool = False
) -> Iterable

The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.

Parameters:

Name	Type	Description	Default
`batch_size`	`int`	Number of rows in each batch	required
`drop_last`	`bool`	Whether to drop the final batch if it is incomplete	`False`

Examples:

batches = ds.batches(batch_size=2000, drop_last=True)
for batch in batches:
    process_batch(batch["images"])

pytorch ¶

pytorch(transform: Callable[[Any], Any] = None)

Returns a PyTorch torch.utils.data. Dataset wrapper around this dataset.

By default, no transformations are applied and each row is returned as a dict with keys of column names.

Parameters:

Name	Type	Description	Default
`transform`	`Callable[[Any], Any]`	A custom function to apply to each sample before returning it	`None`

Raises:

Type	Description
`ImportError`	If pytorch is not installed

Examples:

from torch.utils.data import DataLoader

dl = DataLoader(ds.pytorch(), batch_size=60,
                            shuffle=True, num_workers=8)
for i_batch, sample_batched in enumerate(dl):
    process_batch(sample_batched)

query ¶

query(query: str) -> DatasetView

Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.

Examples:

result = ds.query("select * where category == 'active'")
for row in result:
    print("Id is: ", row["id"])

query_async ¶

query_async(query: str) -> Future[DatasetView]

Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.

Examples:

future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
    print("Id is: ", row["id"])

async def query_and_process():
    # or use the Future in an await expression
    future = ds.query_async("select * where category == 'active'")
    result = await future
    for row in result:
        print("Id is: ", row["id"])

schema `property` ¶

schema: SchemaView

The schema of the dataset.

summary ¶

summary() -> None

Prints a summary of the dataset.

Examples:

ds.summary()

tag ¶

tag(
    name: str | None = None, message: str | None = None
) -> Tag

Saves the current view as a tag to its source dataset and returns the tag.

tensorflow ¶

tensorflow() -> Any

Returns a TensorFlow tensorflow.data.Dataset wrapper around this DatasetView.

Raises:

Type	Description
`ImportError`	If TensorFlow is not installed

Examples:

dl = ds.tensorflow().shuffle(500).batch(32)
for i_batch, sample_batched in enumerate(dl):
     process_batch(sample_batched)

to_csv ¶

to_csv(stream: Any) -> None

Exports the dataset to a stream in CSV format.

Examples:

output = io.StringIO()
ds.to_csv(output)
print(output.getvalue())

Class Comparison¶

Dataset¶

Full read-write access
Can create/modify columns
Can append/update/delete data
Can commit changes (sync and async)
Can create version tags and branches
Can push/pull changes (sync and async)
Can merge branches
Auto-commit functionality
Dataset refresh capabilities
Full metadata access

ds = deeplake.create("s3://bucket/dataset")
# or
ds = deeplake.open("s3://bucket/dataset")

# Can modify
ds.add_column("images", deeplake.types.Image())
ds.add_column("labels", deeplake.types.ClassLabel("int32"))
ds.add_column("confidence", "float32")
ds["labels"].metadata["class_names"] = ["cat", "dog"]   
ds.append([{"images": image_array, "labels": 0, "confidence": 0.9}])
ds.commit()

ReadOnlyDataset¶

Read-only access
Cannot modify data or schema
Can view all data and metadata
Can execute queries (sync and async)
Can refresh dataset state
Access to version history and branches
Full schema and property access
Returned by open_read_only()

ds = deeplake.open_read_only("s3://bucket/dataset")

# Can read
image = ds["images"][0]
metadata = ds.metadata

# Cannot modify
# ds.append([...])  # Would raise error

DatasetView¶

Read-only access
Cannot modify data
Optimized for query results
Direct integration with ML frameworks (PyTorch, TensorFlow)
Batch processing capabilities
Query chaining support
Export to CSV functionality
Schema access
Returned by query() and tag operations

# Get view through query
view = ds.query("SELECT *")

# Access data
image = view["images"][0]

# ML framework integration
torch_dataset = view.pytorch()
tf_dataset = view.tensorflow()

Examples¶

Querying Data¶

# Using Dataset
ds = deeplake.open("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")

# Using ReadOnlyDataset
ds = deeplake.open_read_only("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")

# Using DatasetView
view = ds.query("SELECT * WHERE labels = 'cat'")
subset = view.query("SELECT * WHERE confidence > 0.9")

Data Access¶

# Common access patterns work on all types
for row in ds:  # Works for Dataset, ReadOnlyDataset, and DatasetView
    image = row["images"]
    label = row["labels"]

# Column access works on all types
images = ds["images"][:]
labels = ds["labels"][:]

Import/Export Data¶

# Import from Parquet file
ds = deeplake.from_parquet("data.parquet")
# or from bytes
f = open("data.parquet", "rb")
ds = deeplake.from_parquet(f.read())

# Import from CSV file
ds = deeplake.from_csv("data.csv")
# or from bytes
f = open("data.csv", "rb")
ds = deeplake.from_csv(f.read())

# Export query results to CSV
view = ds.query("SELECT * WHERE labels = 'cat'")
import io
output = io.StringIO()
view.to_csv(output)
csv_data = output.getvalue()

Async Operations¶

# Async query works on all types
future = ds.query_async("SELECT * WHERE labels = 'cat'")
results = future.result()

# Async data access
future = ds["images"].get_async(slice(0, 1000))
images = future.result()

# # Async dataset operations
# future = ds.commit_async("Updated model predictions")
# future.wait()

# # Async push/pull
# ds.push_async("s3://backup/dataset").wait()
# ds.pull_async("s3://upstream/dataset").wait()

Dataset Management¶

# Check if dataset exists
if deeplake.exists("s3://bucket/dataset"):
    ds = deeplake.open("s3://bucket/dataset")
else:
    ds = deeplake.create("s3://bucket/dataset")

# Auto-commit functionality
ds.auto_commit_enabled = True  # Enable automatic commits

# Refresh dataset to get latest changes
ds.refresh()

# Delete dataset (irreversible!)
deeplake.delete("s3://old-bucket/dataset")

Advanced Query Operations¶

# Global query functions
results = deeplake.query("SELECT * FROM 's3://dataset' WHERE confidence > 0.9")

# Async global queries
future = deeplake.query_async("SELECT * FROM 's3://dataset' LIMIT 1000")
results = future.result()

# Explain query execution plan
plan = deeplake.explain_query("SELECT * FROM 's3://dataset' WHERE labels = 'cat'")
print(plan)

# Prepare reusable query executor
executor = deeplake.prepare_query("SELECT * FROM 's3://dataset' WHERE score > ?")

Dataset Classes¶

Creation Methods¶

deeplake ¶

create ¶

like ¶

from_parquet ¶

from_csv ¶

open ¶

open_read_only ¶

delete ¶

delete_async ¶

exists ¶

query ¶

query_async ¶

explain_query ¶

prepare_query ¶

Dataset¶

deeplake.Dataset ¶

add_column ¶

remove_column ¶

rename_column ¶

append ¶

auto_commit_enabled instance-attribute ¶

branch ¶

branches property ¶

tag ¶

tags property ¶

commit ¶

commit_async ¶

history property ¶

version property ¶

created_time property ¶

current_branch property ¶

delete ¶

description instance-attribute ¶

id property ¶

indexing_mode instance-attribute ¶

merge ¶

metadata property ¶

name instance-attribute ¶

pull ¶

pull_async ¶

push ¶

push_async ¶

query ¶

query_async ¶

prepare_query ¶

explain_query ¶

refresh ¶

refresh_async ¶

schema property ¶

summary ¶

set_creds_key ¶

creds_key property ¶

to_csv ¶

batches ¶

tensorflow ¶

pytorch ¶

deeplake.ReadOnlyDataset ¶

branches property ¶

current_branch property ¶

tags property ¶

tag ¶

description property ¶

history property ¶

version property ¶

created_time property ¶

id property ¶

metadata property ¶

name property ¶

query ¶

query_async ¶

refresh ¶

refresh_async ¶

schema property ¶

explain_query ¶

prepare_query ¶

summary ¶

to_csv ¶

tensorflow ¶

auto_commit_enabled `instance-attribute` ¶

branches `property` ¶

tags `property` ¶

history `property` ¶

version `property` ¶

created_time `property` ¶

current_branch `property` ¶

description `instance-attribute` ¶

id `property` ¶

indexing_mode `instance-attribute` ¶

metadata `property` ¶

name `instance-attribute` ¶

schema `property` ¶

creds_key `property` ¶

branches `property` ¶

current_branch `property` ¶

tags `property` ¶

description `property` ¶

history `property` ¶

version `property` ¶

created_time `property` ¶

id `property` ¶

metadata `property` ¶

name `property` ¶

schema `property` ¶

schema `property` ¶