Dataset Classes¶
Deep Lake provides three dataset classes with different access levels:
Class | Description |
---|---|
Dataset | Full read-write access with all operations |
ReadOnlyDataset | Read-only access to prevent modifications |
DatasetView | Read-only view of query results |
Creation Methods¶
deeplake.create
¶
create(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
schema: SchemaTemplate | None = None,
) -> Dataset
Creates a new dataset at the given URL.
To open an existing dataset, use deeplake.open
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
schema
|
dict
|
The initial schema to use for the dataset. See |
None
|
Examples:
# Create a dataset in your local filesystem:
ds = deeplake.create("directory_path")
ds.add_column("id", types.Int32())
ds.add_column("url", types.Text())
ds.add_column("embedding", types.Embedding(768))
ds.commit()
ds.summary()
# Create dataset in your app.activeloop.ai organization:
ds = deeplake.create("al://organization_id/dataset_name")
# Create a dataset stored in your cloud using specified credentials:
ds = deeplake.create("s3://mybucket/my_dataset",
creds = {"aws_access_key_id": id, "aws_secret_access_key": key})
# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = deeplake.create("s3://mybucket/my_dataset",
creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
ds = deeplake.create("azure://bucket/path/to/dataset")
ds = deeplake.create("gcs://bucket/path/to/dataset")
ds = deeplake.create("mem://in-memory")
Raises:
Type | Description |
---|---|
ValueError
|
if a dataset already exists at the given URL |
deeplake.open
¶
open(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Dataset
Opens an existing dataset, potenitally for modifying its content.
See deeplake.open_read_only for opening the dataset in read only mode
To create a new dataset, see deeplake.create
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
Examples:
# Load dataset managed by Deep Lake.
ds = deeplake.open("al://organization_id/dataset_name")
# Load dataset stored in your cloud using your own credentials.
ds = deeplake.open("s3://bucket/my_dataset",
creds = {"aws_access_key_id": id, "aws_secret_access_key": key})
# Load dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.open("s3://bucket/my_dataset",
creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
ds = deeplake.open("s3://bucket/path/to/dataset")
ds = deeplake.open("azure://bucket/path/to/dataset")
ds = deeplake.open("gcs://bucket/path/to/dataset")
deeplake.open_read_only
¶
open_read_only(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> ReadOnlyDataset
Opens an existing dataset in read-only mode.
See deeplake.open for opening datasets for modification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token to authenticate user. |
None
|
Examples:
ds = deeplake.open_read_only("directory_path")
ds.summary()
Example Output:
Dataset length: 5
Columns:
id : int32
url : text
embedding: embedding(768)
ds = deeplake.open_read_only("file:///path/to/dataset")
ds = deeplake.open_read_only("s3://bucket/path/to/dataset")
ds = deeplake.open_read_only("azure://bucket/path/to/dataset")
ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")
ds = deeplake.open_read_only("mem://in-memory")
deeplake.like
¶
like(
src: DatasetView,
dest: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Dataset
Creates a new dataset by copying the source
dataset's structure to a new location.
Note
No data is copied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
DatasetView
|
The dataset to copy the structure from. |
required |
dest
|
str
|
The URL to create the new dataset at.
creds (dict, str, optional): The string
|
required |
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
Examples:
Dataset Class¶
The main class providing full read-write access.
deeplake.Dataset
¶
Bases: DatasetView
Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.
Unlike deeplake.ReadOnlyDataset, instances of Dataset
can be modified.
add_column
¶
add_column(
name: str,
dtype: DataType | str | Type | type | Callable,
format: DataFormat | None = None,
) -> None
Add a new column to the dataset.
Any existing rows in the dataset will have a None
value for the new column
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the column |
required |
dtype
|
DataType | str | Type | type | Callable
|
The type of the column. Possible values include:
|
required |
format
|
DataFormat
|
The format of the column, if applicable. Only required when the dtype is [deeplake.types.DataType][]. |
None
|
Examples:
ds.add_column("labels", deeplake.types.Int32)
ds.add_column("categories", "int32")
ds.add_column("name", deeplake.types.Text())
ds.add_column("json_data", deeplake.types.Dict())
ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))
ds.add_column("embedding", deeplake.types.Embedding(size=768))
Raises:
Type | Description |
---|---|
ColumnAlreadyExistsError
|
If a column with the same name already exists. |
append
¶
append(data: DatasetView) -> None
append(
data: (
list[dict[str, Any]] | dict[str, Any] | DatasetView
)
) -> None
Adds data to the dataset.
The data can be in a variety of formats:
- A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
- A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
- A DatasetView that was generated through any mechanism
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
list[dict[str, Any]] | dict[str, Any] | DatasetView
|
The data to insert into the dataset. |
required |
Examples:
ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})
ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])
ds2.append({
"embedding": np.random.rand(4, 768),
"text": ["Hello World"] * 4})
ds2.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)
Raises:
Type | Description |
---|---|
ColumnMissingAppendValueError
|
If any column is missing from the input data. |
UnevenColumnsError
|
If the input data columns are not the same length. |
InvalidTypeDimensions
|
If the input data does not match the column's dimensions. |
commit
¶
Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
str
|
A message to store in history describing the changes made in the version |
None
|
Examples:
commit_async
¶
Asynchronously commits changes you have made to the dataset.
See deeplake.Dataset.commit for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
str
|
A message to store in history describing the changes made in the commit |
None
|
Examples:
tag
¶
tag(name: str, version: str | None = None) -> Tag
Tags a version of the dataset. If no version is given, the current version is tagged.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the tag |
required |
version
|
str | None
|
The version of the dataset to tag |
None
|
push
¶
Pushes any new history from this dataset to the dataset at the given url
Similar to deeplake.Dataset.pull but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
pull
¶
Pulls any new history from the dataset at the passed url into this dataset.
Similar to deeplake.Dataset.push but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
query
¶
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
¶
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
description
instance-attribute
¶
The description of the dataset. Setting the value will immediately persist the change without requiring a commit().
indexing_mode
instance-attribute
¶
The indexing mode of the dataset. This property can be set to change the indexing mode of the dataset for the current session, other sessions will not be affected.
<!-- test-context
```python
import deeplake
ds = deeplake.create("tmp://")
ds.indexing_mode = deeplake.IndexingMode.Off
ds.add_column("column_name", deeplake.types.Text(deeplake.types.BM25))
a = ['a']*10_000
ds.append({"column_name":a})
ds.commit()
```
-->
Examples:
```python
ds = deeplake.open("tmp://")
ds.indexing_mode = deeplake.IndexingMode.Automatic
ds.commit()
```
ReadOnlyDataset Class¶
Read-only version of Dataset. Cannot modify data but provides access to all data and metadata.
deeplake.ReadOnlyDataset
¶
Bases: DatasetView
query
¶
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
¶
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
DatasetView Class¶
Lightweight view returned by queries. Provides read-only access to query results.
deeplake.DatasetView
¶
A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.
query
¶
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
¶
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
tensorflow
¶
pytorch
¶
Returns a PyTorch torch.utils.data. Dataset
wrapper around this dataset.
By default, no transformations are applied and each row is returned as a dict
with keys of column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform
|
Callable[[Any], Any]
|
A custom function to apply to each sample before returning it |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If pytorch is not installed |
Examples:
Class Comparison¶
Dataset¶
- Full read-write access
- Can create/modify columns
- Can append/update data
- Can commit changes
- Can create version tags
- Can push/pull changes
ds = deeplake.create("s3://bucket/dataset")
# or
ds = deeplake.open("s3://bucket/dataset")
# Can modify
ds.add_column("images", deeplake.types.Image())
ds.add_column("labels", deeplake.types.ClassLabel("int32"))
ds.add_column("confidence", "float32")
ds["labels"].metadata["class_names"] = ["cat", "dog"]
ds.append([{"images": image_array, "labels": 0, "confidence": 0.9}])
ds.commit()
ReadOnlyDataset¶
- Read-only access
- Cannot modify data or schema
- Can view all data and metadata
- Can execute queries
- Returned by
open_read_only()
ds = deeplake.open_read_only("s3://bucket/dataset")
# Can read
image = ds["images"][0]
metadata = ds.metadata
# Cannot modify
# ds.append([...]) # Would raise error
DatasetView¶
- Read-only access
- Cannot modify data
- Optimized for query results
- Direct integration with ML frameworks
- Returned by
query()
# Get view through query
view = ds.query("SELECT *")
# Access data
image = view["images"][0]
# ML framework integration
torch_dataset = view.pytorch()
tf_dataset = view.tensorflow()
Examples¶
Querying Data¶
# Using Dataset
ds = deeplake.open("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")
# Using ReadOnlyDataset
ds = deeplake.open_read_only("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")
# Using DatasetView
view = ds.query("SELECT * WHERE labels = 'cat'")
subset = view.query("SELECT * WHERE confidence > 0.9")
Data Access¶
# Common access patterns work on all types
for row in ds: # Works for Dataset, ReadOnlyDataset, and DatasetView
image = row["images"]
label = row["labels"]
# Column access works on all types
images = ds["images"][:]
labels = ds["labels"][:]