Dataset APIs
Dataset Management
Datasets can be created, loaded, and managed through static factory methods in the deeplake
module.
deeplake.create
create(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
schema: SchemaTemplate | None = None,
) -> Dataset
Creates a new dataset at the given URL.
To open an existing dataset, use deeplake.open
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
schema
|
dict
|
The initial schema to use for the dataset. See |
None
|
Examples:
# Create a dataset in your local filesystem:
ds = deeplake.create("directory_path")
ds.add_column("id", types.Int32())
ds.add_column("url", types.Text())
ds.add_column("embedding", types.Embedding(768))
ds.commit()
ds.summary()
# Create dataset in your app.activeloop.ai organization:
ds = deeplake.create("al://organization_id/dataset_name")
# Create a dataset stored in your cloud using specified credentials:
ds = deeplake.create("s3://mybucket/my_dataset",
creds = {"aws_access_key_id": id, "aws_secret_access_key": key})
# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = deeplake.create("s3://mybucket/my_dataset",
creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
ds = deeplake.create("azure://bucket/path/to/dataset")
ds = deeplake.create("gcs://bucket/path/to/dataset")
ds = deeplake.create("mem://in-memory")
Raises:
Type | Description |
---|---|
ValueError
|
if a dataset already exists at the given URL |
deeplake.create_async
create_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
schema: SchemaTemplate | None = None,
) -> Future
Asynchronously creates a new dataset at the given URL.
See deeplake.create for more information.
To open an existing dataset, use deeplake.open_async.
Examples:
async def create_dataset():
# Asynchronously create a dataset in your local filesystem:
ds = await deeplake.create_async("directory_path")
await ds.add_column("id", types.Int32())
await ds.add_column("url", types.Text())
await ds.add_column("embedding", types.Embedding(768))
await ds.commit()
await ds.summary() # Example of usage in an async context
# Alternatively, create a dataset using .result().
future_ds = deeplake.create_async("directory_path")
ds = future_ds.result() # Blocks until the dataset is created
# Create a dataset in your app.activeloop.ai organization:
ds = await deeplake.create_async("al://organization_id/dataset_name")
# Create a dataset stored in your cloud using specified credentials:
ds = await deeplake.create_async("s3://mybucket/my_dataset",
creds={"aws_access_key_id": id, "aws_secret_access_key": key})
# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = await deeplake.create_async("s3://mybucket/my_dataset",
creds={"creds_key": "managed_creds_key"}, org_id="my_org_id")
ds = await deeplake.create_async("azure://bucket/path/to/dataset")
ds = await deeplake.create_async("gcs://bucket/path/to/dataset")
ds = await deeplake.create_async("mem://in-memory")
Raises:
Type | Description |
---|---|
ValueError
|
if a dataset already exists at the given URL (will be raised when the future is awaited) |
deeplake.open
open(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Dataset
Opens an existing dataset, potenitally for modifying its content.
See deeplake.open_read_only for opening the dataset in read only mode
To create a new dataset, see deeplake.open
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
Examples:
# Load dataset managed by Deep Lake.
ds = deeplake.open("al://organization_id/dataset_name")
# Load dataset stored in your cloud using your own credentials.
ds = deeplake.open("s3://bucket/my_dataset",
creds = {"aws_access_key_id": id, "aws_secret_access_key": key})
# Load dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.open("s3://bucket/my_dataset",
creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
ds = deeplake.open("s3://bucket/path/to/dataset")
ds = deeplake.open("azure://bucket/path/to/dataset")
ds = deeplake.open("gcs://bucket/path/to/dataset")
deeplake.open_async
open_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Future
Asynchronously opens an existing dataset, potentially for modifying its content.
See deeplake.open for opening the dataset synchronously.
Examples:
async def async_open():
# Asynchronously load dataset managed by Deep Lake using await.
ds = await deeplake.open_async("al://organization_id/dataset_name")
# Asynchronously load dataset stored in your cloud using your own credentials.
ds = await deeplake.open_async("s3://bucket/my_dataset",
creds={"aws_access_key_id": id, "aws_secret_access_key": key})
# Asynchronously load dataset stored in your cloud using Deep Lake managed credentials.
ds = await deeplake.open_async("s3://bucket/my_dataset",
creds={"creds_key": "managed_creds_key"}, org_id="my_org_id")
ds = await deeplake.open_async("s3://bucket/path/to/dataset")
ds = await deeplake.open_async("azure://bucket/path/to/dataset")
ds = await deeplake.open_async("gcs://bucket/path/to/dataset")
# Alternatively, load the dataset using .result().
future_ds = deeplake.open_async("al://organization_id/dataset_name")
ds = future_ds.result() # Blocks until the dataset is loaded
deeplake.open_read_only
open_read_only(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> ReadOnlyDataset
Opens an existing dataset in read-only mode.
See deeplake.open for opening datasets for modification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token to authenticate user. |
None
|
Examples:
ds = deeplake.open_read_only("directory_path")
ds.summary()
Example Output:
Dataset length: 5
Columns:
id : int32
url : text
embedding: embedding(768)
ds = deeplake.open_read_only("file:///path/to/dataset")
ds = deeplake.open_read_only("s3://bucket/path/to/dataset")
ds = deeplake.open_read_only("azure://bucket/path/to/dataset")
ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")
ds = deeplake.open_read_only("mem://in-memory")
deeplake.open_read_only_async
open_read_only_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Future
Asynchronously opens an existing dataset in read-only mode.
See deeplake.open_async for opening datasets for modification and deeplake.open_read_only for sync open.
Examples:
# Asynchronously open a dataset in read-only mode:
ds = await deeplake.open_read_only_async("directory_path")
# Alternatively, open the dataset using .result().
future_ds = deeplake.open_read_only_async("directory_path")
ds = future_ds.result() # Blocks until the dataset is loaded
ds = await deeplake.open_read_only_async("file:///path/to/dataset")
ds = await deeplake.open_read_only_async("s3://bucket/path/to/dataset")
ds = await deeplake.open_read_only_async("azure://bucket/path/to/dataset")
ds = await deeplake.open_read_only_async("gcs://bucket/path/to/dataset")
ds = await deeplake.open_read_only_async("mem://in-memory")
deeplake.delete
Deletes an existing dataset.
Warning
This operation is irreversible. All data will be lost.
If concurrent processes are attempting to write to the dataset while it's being deleted, it may lead to data inconsistency. It's recommended to use this operation with caution.
deeplake.copy
copy(
src: str,
dst: str,
src_creds: dict[str, str] | None = None,
dst_creds: dict[str, str] | None = None,
token: str | None = None,
) -> None
Copies the dataset at the source URL to the destination URL.
NOTE: Currently private due to potential issues in file timestamp handling
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
str
|
The URL of the source dataset. |
required |
dst
|
str
|
The URL of the destination dataset. |
required |
src_creds
|
(dict, str)
|
The string |
None
|
dst_creds
|
(dict, str)
|
The string |
None
|
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
Examples:
deeplake.like
like(
src: DatasetView,
dest: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Dataset
Creates a new dataset by copying the source
dataset's structure to a new location.
Note
No data is copied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
DatasetView
|
The dataset to copy the structure from. |
required |
dest
|
str
|
The URL to create the new dataset at.
creds (dict, str, optional): The string
|
required |
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
Examples:
deeplake.from_parquet
from_parquet(url: str) -> ReadOnlyDataset
Opens a Parquet dataset in the deeplake format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the Parquet dataset. If no protocol is specified, it assumes |
required |
deeplake.connect
connect(
src: str,
dest: str | None = None,
org_id: str | None = None,
creds_key: str | None = None,
token: str | None = None,
) -> Dataset
Connects an existing dataset your app.activeloop.ai account.
Either dest
or org_id
is required but not both.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
str
|
The URL to the existing dataset. |
required |
dest
|
str
|
Desired Activeloop url for the dataset entry. Example: |
None
|
org_id
|
str
|
The id of the organization to store the dataset under. The dataset name will be based on the source dataset's name. |
None
|
creds_key
|
str
|
The creds_key of the managed credentials that will be used to access the source path. If not set, use the organization's default credentials. |
None
|
token
|
str
|
Activeloop token used to fetch the managed credentials. |
None
|
Examples:
```python ds = deeplake.connect("s3://bucket/path/to/dataset", "al://my_org/dataset")
ds = deeplake.connect("s3://bucket/path/to/dataset", "al://my_org/dataset", creds_key="my_key")
Connect the dataset as al://my_org/dataset
ds = deeplake.connect("s3://bucket/path/to/dataset", org_id="my_org")
ds = deeplake.connect("az://bucket/path/to/dataset", "al://my_org/dataset", creds_key="my_key")
ds = deeplake.connect("gcs://bucket/path/to/dataset", "al://my_org/dataset", creds_key="my_key")
deeplake.disconnect
Disconnect the dataset your Activeloop account.
See deeplake.connect
Note
Does not delete the stored data, it only removes the connection from the activeloop organization
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. |
required |
token
|
str
|
Activeloop token to authenticate user. |
None
|
Examples:
deeplake.convert
convert(
src: str,
dst: str,
dst_creds: Optional[Dict[str, str]] = None,
token: Optional[str] = None,
) -> None
Copies the v3 dataset at src into a new dataset in the new v4 format.
deeplake.Dataset
Bases: DatasetView
Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.
Unlike deeplake.ReadOnlyDataset, instances of Dataset
can be modified.
__getitem__
__getitem__(offset: int) -> Row
__getitem__(range: slice) -> RowRange
__getitem__(indices: list) -> RowRange
__getitem__(indices: tuple) -> RowRange
__getitem__(column: str) -> Column
Returns a subset of data from the Dataset
The result will depend on the type of value passed to the []
operator.
int
: The zero-based offset of the single row to return. Returns a deeplake.Rowslice
: A slice specifying the range of rows to return. Returns a deeplake.RowRangelist
: A list of indices specifying the rows to return. Returns a deeplake.RowRangetuple
: A tuple of indices specifying the rows to return. Returns a deeplake.RowRangestr
: A string specifying column to return all values from. Returns a deeplake.Column
Examples:
__getstate__
Returns a dict that can be pickled and used to restore this dataset.
Note
Pickling a dataset does not copy the dataset, it only saves attributes that can be used to restore the dataset.
__iter__
__iter__() -> Iterator[Row]
__setstate__
Restores dataset from a pickled state.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
arg0
|
dict
|
The pickled state used to restore the dataset. |
required |
add_column
add_column(
name: str,
dtype: DataType | str | Type | type | Callable,
format: DataFormat | None = None,
) -> None
Add a new column to the dataset.
Any existing rows in the dataset will have a None
value for the new column
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the column |
required |
dtype
|
DataType | str | Type | type | Callable
|
The type of the column. Possible values include:
|
required |
format
|
DataFormat
|
The format of the column, if applicable. Only required when the dtype is deeplake.types.DataType. |
None
|
Examples:
ds.add_column("labels", deeplake.types.Int32)
ds.add_column("categories", "int32")
ds.add_column("name", deeplake.types.Text())
ds.add_column("json_data", deeplake.types.Dict())
ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))
ds.add_column("embedding", deeplake.types.Embedding(size=768))
Raises:
Type | Description |
---|---|
ColumnAlreadyExistsError
|
If a column with the same name already exists. |
append
append(data: DatasetView) -> None
append(
data: (
list[dict[str, Any]] | dict[str, Any] | DatasetView
)
) -> None
Adds data to the dataset.
The data can be in a variety of formats:
- A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
- A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
- A DatasetView that was generated through any mechanism
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
list[dict[str, Any]] | dict[str, Any] | DatasetView
|
The data to insert into the dataset. |
required |
Examples:
ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})
ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])
ds2.append({
"embedding": np.random.rand(4, 768),
"text": ["Hello World"] * 4})
ds2.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)
Raises:
Type | Description |
---|---|
ColumnMissingAppendValueError
|
If any column is missing from the input data. |
UnevenColumnsError
|
If the input data columns are not the same length. |
InvalidTypeDimensions
|
If the input data does not match the column's dimensions. |
batches
The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int
|
Number of rows in each batch |
required |
drop_last
|
bool
|
Whether to drop the final batch if it is incomplete |
False
|
Examples:
commit
Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
str
|
A message to store in history describing the changes made in the version |
None
|
Examples:
commit_async
commit_async(message: str | None = None) -> FutureVoid
Asynchronously commits changes you have made to the dataset.
See deeplake.Dataset.commit for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
str
|
A message to store in history describing the changes made in the commit |
None
|
Examples:
```python ds.commit_async().wait()
ds.commit_async("Added data from updated documents").wait()
def do_commit(): await ds.commit_async()
future = ds.commit_async() # then you can check if the future is completed using future.is_completed()
created_time
property
When the dataset was created. The value is auto-generated at creation time.
delete
Delete a row from the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset
|
int
|
The offset of the row within the dataset to delete |
required |
description
instance-attribute
The description of the dataset. Setting the value will immediately persist the change without requiring a commit().
name
instance-attribute
The name of the dataset. Setting the value will immediately persist the change without requiring a commit().
pull
Pulls any new history from the dataset at the passed url into this dataset.
Similar to deeplake.Dataset.push but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
pull_async
pull_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> FutureVoid
Asynchronously pulls any new history from the dataset at the passed url into this dataset.
Similar to deeplake.Dataset.push_async but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
push
Pushes any new history from this dataset to the dataset at the given url
Similar to deeplake.Dataset.pull but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
push_async
push_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> FutureVoid
Asynchronously Pushes new any history from this dataset to the dataset at the given url
Similar to deeplake.Dataset.pull_async but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
pytorch
Returns a PyTorch torch.utils.data. Dataset
wrapper around this dataset.
By default, no transformations are applied and each row is returned as a dict
with keys of column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform
|
Callable[[Any], Any]
|
A custom function to apply to each sample before returning it |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If pytorch is not installed |
Examples:
remove_column
rename_column
Renames the existing column in the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the column to rename |
required |
new_name
|
str
|
The new name to set to column |
required |
Examples:
Raises:
Type | Description |
---|---|
ColumnDoesNotExistsError
|
If a column with the specified name does not exists. |
ColumnAlreadyExistsError
|
If a column with the specified new name already exists. |
rollback
Reverts any in-progress changes to the dataset you have made. Does not revert any changes that have been committed.
rollback_async
rollback_async() -> FutureVoid
Asynchronously reverts any in-progress changes to the dataset you have made. Does not revert any changes that have been committed.
set_creds_key
Sets the key used to store the credentials for the dataset.
tag
tag(name: str, version: str | None = None) -> Tag
Tags a version of the dataset. If no version is given, the current version is tagged.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the tag |
required |
version
|
str | None
|
The version of the dataset to tag |
None
|
tensorflow
deeplake.ReadOnlyDataset
Bases: DatasetView
__getitem__
__getitem__(offset: int) -> RowView
__getitem__(range: slice) -> RowRangeView
__getitem__(indices: list) -> RowRangeView
__getitem__(indices: tuple) -> RowRangeView
__getitem__(column: str) -> ColumnView
__getitem__(
input: int | slice | list | tuple | str,
) -> RowView | RowRangeView | ColumnView
Returns a subset of data from the DatasetView.
The result will depend on the type of value passed to the []
operator.
int
: The zero-based offset of the single row to return. Returns a deeplake.RowViewslice
: A slice specifying the range of rows to return. Returns a deeplake.RowRangeViewlist
: A list of indices specifying the rows to return. Returns a deeplake.RowRangeViewtuple
: A tuple of indices specifying the rows to return. Returns a [deeplake.RowRangeViewstr
: A string specifying column to return all values from. Returns a deeplake.ColumnView
Examples:
__iter__
__iter__() -> Iterator[RowView]
batches
The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int
|
Number of rows in each batch |
required |
drop_last
|
bool
|
Whether to drop the final batch if it is incomplete |
False
|
Examples:
created_time
property
When the dataset was created. The value is auto-generated at creation time.
push
Pushes any history from this dataset to the dataset at the given url
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
push_async
push_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> FutureVoid
Asynchronously Pushes any history from this dataset to the dataset at the given url
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
pytorch
Returns a PyTorch torch.utils.data. Dataset
wrapper around this dataset.
By default, no transformations are applied and each row is returned as a dict
with keys of column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform
|
Callable[[Any], Any]
|
A custom function to apply to each sample before returning it |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If pytorch is not installed |
Examples:
query
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
query_async(query: str) -> Future
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
tag
tag(name: str | None = None) -> Tag
Saves the current view as a tag to its source dataset and returns the tag.
tensorflow
deeplake.DatasetView
A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.
__getitem__
__getitem__(offset: int) -> RowView
__getitem__(range: slice) -> RowRangeView
__getitem__(indices: list) -> RowRangeView
__getitem__(indices: tuple) -> RowRangeView
__getitem__(column: str) -> ColumnView
__getitem__(
input: int | slice | list | tuple | str,
) -> RowView | RowRangeView | ColumnView
Returns a subset of data from the DatasetView.
The result will depend on the type of value passed to the []
operator.
int
: The zero-based offset of the single row to return. Returns a deeplake.RowViewslice
: A slice specifying the range of rows to return. Returns a deeplake.RowRangeViewlist
: A list of indices specifying the rows to return. Returns a deeplake.RowRangeViewtuple
: A tuple of indices specifying the rows to return. Returns a [deeplake.RowRangeViewstr
: A string specifying column to return all values from. Returns a deeplake.ColumnView
Examples:
__iter__
__iter__() -> Iterator[RowView]
batches
The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int
|
Number of rows in each batch |
required |
drop_last
|
bool
|
Whether to drop the final batch if it is incomplete |
False
|
Examples:
pytorch
Returns a PyTorch torch.utils.data. Dataset
wrapper around this dataset.
By default, no transformations are applied and each row is returned as a dict
with keys of column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform
|
Callable[[Any], Any]
|
A custom function to apply to each sample before returning it |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If pytorch is not installed |
Examples:
query
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
query_async(query: str) -> Future
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
tag
tag(name: str | None = None) -> Tag
Saves the current view as a tag to its source dataset and returns the tag.
tensorflow
deeplake.Column
Bases: ColumnView
Provides read-write access to a column in a dataset. Column extends ColumnView with methods for modifying data, making it suitable for dataset creation and updates in ML workflows.
The Column class allows you to: - Read and write data using integer indices, slices, or lists of indices - Modify data asynchronously for better performance - Access and modify column metadata - Handle various data types common in ML: images, embeddings, labels, etc.
Examples:
Update training labels:
# Update single label
ds["labels"][0] = 1
# Update batch of labels
ds["labels"][0:32] = new_labels
# Async update for better performance
future = ds["labels"].set_async(slice(0, 32), new_labels)
future.wait()
Store image embeddings:
# Generate and store embeddings
embeddings = model.encode(images)
ds["embeddings"][0:len(embeddings)] = embeddings
Manage column metadata:
# Store preprocessing parameters
ds["images"].metadata["mean"] = [0.485, 0.456, 0.406]
ds["images"].metadata["std"] = [0.229, 0.224, 0.225]
__getitem__
Retrieve data from the column at the specified index or range.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) - list/tuple: Multiple specific indices |
required |
Returns:
Type | Description |
---|---|
Any
|
The data at the specified index/indices. Type depends on the column's data type. |
Examples:
__len__
Get the number of items in the column.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of items in the column. |
__setitem__
Set data in the column at the specified index or range.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int | slice
|
Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) |
required |
value
|
Any
|
The data to store. Must match the column's data type. |
required |
Examples:
get_async
get_async(index: int | slice | list | tuple) -> Future
Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices - list/tuple: Multiple specific indices |
required |
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
A Future object that resolves to the requested data. |
Examples:
name
property
Get the name of the column.
Returns:
Name | Type | Description |
---|---|---|
str |
str
|
The column name. |
set_async
set_async(index: int | slice, value: Any) -> FutureVoid
Asynchronously set data in the column. Useful for large updates or when modifying multiple items in ML pipelines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int | slice
|
Can be: - int: Single item index - slice: Range of indices |
required |
value
|
Any
|
The data to store. Must match the column's data type. |
required |
Returns:
Name | Type | Description |
---|---|---|
FutureVoid |
FutureVoid
|
A FutureVoid that completes when the update is finished. |
Examples:
deeplake.ColumnView
Provides read-only access to a column in a dataset. ColumnView is designed for efficient data access in ML workflows, supporting both synchronous and asynchronous operations.
The ColumnView class allows you to: - Access column data using integer indices, slices, or lists of indices - Retrieve data asynchronously for better performance in ML pipelines - Access column metadata and properties - Get information about linked data if the column contains references
Examples:
Load image data from a column for training:
# Access a single image
image = ds["images"][0]
# Load a batch of images
batch = ds["images"][0:32]
# Async load for better performance
images_future = ds["images"].get_async(slice(0, 32))
images = images_future.result()
Access embeddings for similarity search:
# Get all embeddings
embeddings = ds["embeddings"][:]
# Get specific embeddings by indices
selected = ds["embeddings"][[1, 5, 10]]
Check column properties:
# Get column name
name = ds["images"].name
# Access metadata
if "mean" in ds["images"].metadata.keys():
mean = dataset["images"].metadata["mean"]
__getitem__
Retrieve data from the column at the specified index or range.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices (e.g., 0:10) - list/tuple: Multiple specific indices |
required |
Returns:
Type | Description |
---|---|
Any
|
The data at the specified index/indices. Type depends on the column's data type. |
Examples:
__len__
Get the number of items in the column.
Returns:
Name | Type | Description |
---|---|---|
int |
int
|
Number of items in the column. |
get_async
get_async(index: int | slice | list | tuple) -> Future
Asynchronously retrieve data from the column. Useful for large datasets or when loading multiple items in ML pipelines.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
index
|
int | slice | list | tuple
|
Can be: - int: Single item index - slice: Range of indices - list/tuple: Multiple specific indices |
required |
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
A Future object that resolves to the requested data. |
Examples:
metadata
property
Access the column's metadata. Useful for storing statistics, preprocessing parameters, or other information about the column data.
Returns:
Name | Type | Description |
---|---|---|
ReadOnlyMetadata |
ReadOnlyMetadata
|
A ReadOnlyMetadata object for reading metadata. |
Examples:
deeplake.Row
Provides mutable access to a particular row in a dataset.
get_async
get_async(column: str) -> Future
Asynchronously retrieves data for the specified column and returns a Future object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to retrieve data for. |
required |
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
A Future object that will resolve to the value containing the column data. |
Examples:
future = row.get_async("column_name")
column = future.result() # Blocking call to get the result when it's ready.
Notes
- The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
- You can either wait for the result using
future.result()
(a blocking call) or use the Future in anawait
expression.
set_async
set_async(column: str, value: Any) -> FutureVoid
Asynchronously sets a value for the specified column and returns a FutureVoid object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to update. |
required |
value
|
Any
|
The value to set for the column. |
required |
Returns:
Name | Type | Description |
---|---|---|
FutureVoid |
FutureVoid
|
A FutureVoid object that will resolve when the operation is complete. |
Examples:
future_void = row.set_async("column_name", new_value)
future_void.wait() # Blocks until the operation is complete.
Notes
- The method sets the value asynchronously and immediately returns a FutureVoid.
- You can either block and wait for the operation to complete using
wait()
or await the FutureVoid object in an asynchronous context.
deeplake.RowView
Provides access to a particular row in a dataset.
get_async
get_async(column: str) -> Future
Asynchronously retrieves data for the specified column and returns a Future object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to retrieve data for. |
required |
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
A Future object that will resolve to the value containing the column data. |
Examples:
future = row_view.get_async("column_name")
column = future.result() # Blocking call to get the result when it's ready.
Notes
- The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
- You can either wait for the result using
future.result()
(a blocking call) or use the Future in anawait
expression.
deeplake.RowRange
Provides mutable access to a row range in a dataset.
get_async
get_async(column: str) -> Future
Asynchronously retrieves data for the specified column and returns a Future object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to retrieve data for. |
required |
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
A Future object that will resolve to the value containing the column data. |
Examples:
future = row_range.get_async("column_name")
column = future.result() # Blocking call to get the result when it's ready.
Notes
- The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
- You can either wait for the result using
future.result()
(a blocking call) or use the Future in anawait
expression.
set_async
set_async(column: str, value: Any) -> FutureVoid
Asynchronously sets a value for the specified column and returns a FutureVoid object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to update. |
required |
value
|
Any
|
The value to set for the column. |
required |
Returns:
Name | Type | Description |
---|---|---|
FutureVoid |
FutureVoid
|
A FutureVoid object that will resolve when the operation is complete. |
Examples:
future_void = row_range.set_async("column_name", new_value)
future_void.wait() # Blocks until the operation is complete.
Notes
- The method sets the value asynchronously and immediately returns a FutureVoid.
- You can either block and wait for the operation to complete using
wait()
or await the FutureVoid object in an asynchronous context.
deeplake.RowRangeView
Provides access to a row range in a dataset.
get_async
get_async(column: str) -> Future
Asynchronously retrieves data for the specified column and returns a Future object.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
column
|
str
|
The name of the column to retrieve data for. |
required |
Returns:
Name | Type | Description |
---|---|---|
Future |
Future
|
A Future object that will resolve to the value containing the column data. |
Examples:
future = row_range_view.get_async("column_name")
column = future.result() # Blocking call to get the result when it's ready.
Notes
- The Future will resolve asynchronously, meaning the method will not block execution while the data is being retrieved.
- You can either wait for the result using
future.result()
(a blocking call) or use the Future in anawait
expression.
deeplake.Future
A future representing an asynchronous operation result in ML pipelines.
The Future class enables non-blocking operations for data loading and processing, particularly useful when working with large ML datasets or distributed training. Once resolved, the Future holds the operation result which can be accessed either synchronously or asynchronously.
Methods:
Name | Description |
---|---|
result |
Blocks until the Future resolves and returns the result. |
__await__ |
Enables using the Future in async/await syntax. |
is_completed |
Checks if the Future has resolved without blocking. |
Examples:
Loading ML dataset asynchronously:
future = deeplake.open_async("s3://ml-data/embeddings")
# Check status without blocking
if not future.is_completed():
print("Still loading...")
# Block until ready
ds = future.result()
Using with async/await:
async def load_data():
ds = await deeplake.open_async("s3://ml-data/images")
batch = await ds.images.get_async(slice(0, 32))
return batch
__await__
is_completed
deeplake.FutureVoid
A Future representing a void async operation in ML pipelines.
Similar to Future but for operations that don't return values, like saving or committing changes. Useful for non-blocking data management operations.
Methods:
Name | Description |
---|---|
wait |
Blocks until operation completes. |
__await__ |
Enables using with async/await syntax. |
is_completed |
Checks completion status without blocking. |
Examples:
Asynchronous dataset updates:
# Update embeddings without blocking
future = ds["embeddings"].set_async(slice(0, 32), new_embeddings)
# Do other work while update happens
process_other_data()
# Wait for update to complete
future.wait()
Using with async/await: