Dataset Classes¶
Deep Lake provides three dataset classes with different access levels:
Class | Description |
---|---|
Dataset | Full read-write access with all operations |
ReadOnlyDataset | Read-only access to prevent modifications |
DatasetView | Read-only view of query results |
Creation Methods¶
deeplake
¶
create
¶
create(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
schema: dict[str, Any] | Any | None = None,
) -> Dataset
Creates a new dataset at the given URL.
To open an existing dataset, use deeplake.open
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
schema
|
dict
|
The initial schema to use for the dataset. See |
None
|
Examples:
# Create a dataset in your local filesystem:
ds = deeplake.create("directory_path")
ds.add_column("id", types.Int32())
ds.add_column("url", types.Text())
ds.add_column("embedding", types.Embedding(768))
ds.commit()
ds.summary()
# Create dataset in your app.activeloop.ai organization:
ds = deeplake.create("al://organization_id/dataset_name")
# Create a dataset stored in your cloud using specified credentials:
ds = deeplake.create("s3://mybucket/my_dataset",
creds = {"aws_access_key_id": id, "aws_secret_access_key": key})
# Create dataset stored in your cloud using app.activeloop.ai managed credentials.
ds = deeplake.create("s3://mybucket/my_dataset",
creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
ds = deeplake.create("azure://bucket/path/to/dataset")
ds = deeplake.create("gcs://bucket/path/to/dataset")
ds = deeplake.create("mem://in-memory")
Raises:
Type | Description |
---|---|
LogExistsError
|
if a dataset already exists at the given URL |
like
¶
like(
src: DatasetView,
dest: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Dataset
Creates a new dataset by copying the source
dataset's structure to a new location.
Note
No data is copied.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
src
|
DatasetView
|
The dataset to copy the structure from. |
required |
dest
|
str
|
The URL to create the new dataset at.
creds (dict, str, optional): The string
|
required |
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
Examples:
from_parquet
¶
from_parquet(url_or_bytes: bytes | str) -> ReadOnlyDataset
Opens a Parquet dataset in the deeplake format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url_or_bytes
|
bytes | str
|
The URL of the Parquet dataset or bytes of the Parquet file. If no protocol is specified, it assumes |
required |
from_csv
¶
from_csv(url_or_bytes: bytes | str) -> ReadOnlyDataset
Opens a CSV dataset in the deeplake format.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url_or_bytes
|
bytes | str
|
The URL of the CSV dataset or bytes of the CSV file. If no protocol is specified, it assumes |
required |
open
¶
open(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> Dataset
Opens an existing dataset, potenitally for modifying its content.
See deeplake.open_read_only for opening the dataset in read only mode
To create a new dataset, see deeplake.create
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
Examples:
# Load dataset managed by Deep Lake.
ds = deeplake.open("al://organization_id/dataset_name")
# Load dataset stored in your cloud using your own credentials.
ds = deeplake.open("s3://bucket/my_dataset",
creds = {"aws_access_key_id": id, "aws_secret_access_key": key})
# Load dataset stored in your cloud using Deep Lake managed credentials.
ds = deeplake.open("s3://bucket/my_dataset",
creds = {"creds_key": "managed_creds_key"}, org_id = "my_org_id")
ds = deeplake.open("s3://bucket/path/to/dataset")
ds = deeplake.open("azure://bucket/path/to/dataset")
ds = deeplake.open("gcs://bucket/path/to/dataset")
open_read_only
¶
open_read_only(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> ReadOnlyDataset
Opens an existing dataset in read-only mode.
See deeplake.open for opening datasets for modification.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the dataset. URLs can be specified using the following protocols:
A URL without a protocol is assumed to be a file:// URL |
required |
creds
|
(dict, str)
|
The string
|
None
|
token
|
str
|
Activeloop token to authenticate user. |
None
|
Examples:
ds = deeplake.open_read_only("directory_path")
ds.summary()
Example Output:
Dataset length: 5
Columns:
id : int32
url : text
embedding: embedding(768)
ds = deeplake.open_read_only("file:///path/to/dataset")
ds = deeplake.open_read_only("s3://bucket/path/to/dataset")
ds = deeplake.open_read_only("azure://bucket/path/to/dataset")
ds = deeplake.open_read_only("gcs://bucket/path/to/dataset")
ds = deeplake.open_read_only("mem://in-memory")
delete
¶
Deletes an existing dataset.
Warning
This operation is irreversible. All data will be lost.
If concurrent processes are attempting to write to the dataset while it's being deleted, it may lead to data inconsistency. It's recommended to use this operation with caution.
delete_async
¶
delete_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> FutureVoid
Asynchronously deletes an existing dataset.
Warning
This operation is irreversible. All data will be lost.
If concurrent processes are attempting to write to the dataset while it's being deleted, it may lead to data inconsistency. It's recommended to use this operation with caution.
exists
¶
Check if a dataset exists at the given URL
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
URL of the dataset |
required |
creds
|
dict[str, str] | None
|
The string |
None
|
token
|
str | None
|
Activeloop token, used for fetching credentials to the dataset at path if it is a Deep Lake dataset. This is optional, tokens are normally autogenerated. |
None
|
query
¶
query(
query: str,
token: str | None = None,
creds: dict[str, str] | None = None,
) -> DatasetView
Executes TQL queries optimized for ML data filtering and search.
TQL is a SQL-like query language designed for ML datasets, supporting:
- Vector similarity search
- Text semantic search
- Complex data filtering
- Joining across datasets
- Efficient sorting and pagination
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string supporting:
|
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
creds
|
dict
|
Dictionary containing credentials used to access the dataset at the path.
|
None
|
Returns:
Name | Type | Description |
---|---|---|
DatasetView |
DatasetView
|
Query results that can be:
|
Examples:
Vector similarity search:
# Find similar embeddings
similar = deeplake.query('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
# Use results in training
dataloader = similar.pytorch()
Text semantic search:
# Search documents using BM25
relevant = deeplake.query('''
SELECT * FROM "mem://documents"
ORDER BY BM25_SIMILARITY(text, 'machine learning') DESC
LIMIT 10
''')
Complex filtering:
# Filter training data
train = deeplake.query('''
SELECT * FROM "mem://dataset"
WHERE "train_split" = 'train'
AND confidence > 0.9
AND label IN ('cat', 'dog')
''')
Joins for feature engineering:
query_async
¶
query_async(
query: str,
token: str | None = None,
creds: dict[str, str] | None = None,
) -> Future[DatasetView]
Asynchronously executes TQL queries optimized for ML data filtering and search.
Non-blocking version of query()
for better performance with large datasets.
Supports the same TQL features including vector similarity search, text search,
filtering, and joins.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string supporting: - Vector similarity: COSINE_SIMILARITY, EUCLIDEAN_DISTANCE - Text search: BM25_SIMILARITY, CONTAINS - Filtering: WHERE clauses - Sorting: ORDER BY - Joins: JOIN across datasets |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
creds
|
dict
|
Dictionary containing credentials used to access the dataset at the path.
|
None
|
Returns:
Name | Type | Description |
---|---|---|
Future |
Future[DatasetView]
|
Resolves to DatasetView that can be: - Used directly in ML training - Further filtered with additional queries - Converted to PyTorch/TensorFlow dataloaders - Materialized into a new dataset |
Examples:
Basic async query:
# Run query asynchronously
future = deeplake.query_async('''
SELECT * FROM "mem://embeddings"
ORDER BY COSINE_SIMILARITY(vector, ARRAY[0.1, 0.2, 0.3]) DESC
''')
# Do other work while query runs
prepare_training()
# Get results when needed
results = future.result()
With async/await:
async def search_similar():
results = await deeplake.query_async('''
SELECT * FROM "mem://images"
ORDER BY COSINE_SIMILARITY(embedding, ARRAY[0.1, 0.2, 0.3]) DESC
LIMIT 100
''')
return results
async def main():
similar = await search_similar()
Non-blocking check:
explain_query
¶
explain_query(
query: str,
token: str | None = None,
creds: dict[str, str] | None = None,
) -> ExplainQueryResult
Explains TQL query with optional authentication.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string to explain |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
creds
|
dict
|
Dictionary containing credentials used to access the dataset at the path. |
None
|
Returns:
Name | Type | Description |
---|---|---|
ExplainQueryResult |
ExplainQueryResult
|
An explain result object to analyze the query. |
Examples:
Explaining a query:
prepare_query
¶
prepare_query(
query: str,
token: str | None = None,
creds: dict[str, str] | None = None,
) -> Executor
Prepares a TQL query for execution with optional authentication.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
TQL query string to execute |
required |
token
|
str | None
|
Optional Activeloop authentication token |
None
|
creds
|
dict
|
Dictionary containing credentials used to access the dataset at the path. |
None
|
Returns:
Name | Type | Description |
---|---|---|
Executor |
Executor
|
An executor object to run the query. |
Examples:
Running a parametrized batch query:
Dataset¶
The main class providing full read-write access.
deeplake.Dataset
¶
Bases: DatasetView
Datasets are the primary data structure used in DeepLake. They are used to store and manage data for searching, training, evaluation.
Unlike deeplake.ReadOnlyDataset, instances of Dataset
can be modified.
add_column
¶
Add a new column to the dataset.
Any existing rows in the dataset will have a None
value for the new column
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the column |
required |
default_value
|
Any
|
The default value to set for existing rows. If not provided, existing rows will have a |
None
|
dtype
|
Any
|
The type of the column. Possible values include:
|
required |
Examples:
ds.add_column("labels", deeplake.types.Int32)
ds.add_column("categories", "int32")
ds.add_column("name", deeplake.types.Text())
ds.add_column("json_data", deeplake.types.Dict())
ds.add_column("images", deeplake.types.Image(dtype=deeplake.types.UInt8(), sample_compression="jpeg"))
ds.add_column("embedding", deeplake.types.Embedding(size=768))
Raises:
Type | Description |
---|---|
ColumnAlreadyExistsError
|
If a column with the same name already exists. |
remove_column
¶
rename_column
¶
Renames the existing column in the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the column to rename |
required |
new_name
|
str
|
The new name to set to column |
required |
Examples:
Raises:
Type | Description |
---|---|
ColumnDoesNotExistsError
|
If a column with the specified name does not exists. |
ColumnAlreadyExistsError
|
If a column with the specified new name already exists. |
append
¶
append(data: DatasetView) -> None
append(
data: (
list[dict[str, Any]] | dict[str, Any] | DatasetView
),
) -> None
Adds data to the dataset.
The data can be in a variety of formats:
- A list of dictionaries, each value in the list is a row, with the dicts containing the column name and its value for the row.
- A dictionary, the keys are the column names and the values are array-like (list or numpy.array) objects corresponding to the column values.
- A DatasetView that was generated through any mechanism
Parameters:
Name | Type | Description | Default |
---|---|---|---|
data
|
list[dict[str, Any]] | dict[str, Any] | DatasetView
|
The data to insert into the dataset. |
required |
Examples:
ds.append({"name": ["Alice", "Bob"], "age": [25, 30]})
ds.append([{"name": "Alice", "age": 25}, {"name": "Bob", "age": 30}])
ds2.append({
"embedding": np.random.rand(4, 768),
"text": ["Hello World"] * 4})
ds2.append([{"embedding": np.random.rand(768), "text": "Hello World"}] * 4)
Raises:
Type | Description |
---|---|
ColumnMissingAppendValueError
|
If any column is missing from the input data. |
UnevenColumnsError
|
If the input data columns are not the same length. |
InvalidTypeDimensions
|
If the input data does not match the column's dimensions. |
auto_commit_enabled
instance-attribute
¶
branch
¶
branch(name: str, version: str | None = None) -> Branch
Creates a branch with the given version of the current branch. If no version is given, the current version will be picked up.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the branch |
required |
version
|
str | None
|
The version of the dataset |
None
|
tag
¶
tag(
name: str,
message: str | None = None,
version: str | None = None,
) -> Tag
Tags a version of the dataset. If no version is given, the current version is tagged.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the tag |
required |
version
|
str | None
|
The version of the dataset to tag |
None
|
commit
¶
Atomically commits changes you have made to the dataset. After commit, other users will see your changes to the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
str
|
A message to store in history describing the changes made in the version |
None
|
Examples:
commit_async
¶
Asynchronously commits changes you have made to the dataset.
See deeplake.Dataset.commit for more information.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
message
|
str
|
A message to store in history describing the changes made in the commit |
None
|
Examples:
created_time
property
¶
When the dataset was created. The value is auto-generated at creation time.
delete
¶
Delete a row from the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
offset
|
int
|
The offset of the row within the dataset to delete |
required |
description
instance-attribute
¶
The description of the dataset. Setting the value will immediately persist the change without requiring a commit().
id
property
¶
The unique identifier of the dataset. Value is auto-generated at creation time.
indexing_mode
instance-attribute
¶
merge
¶
Merge the given branch into the current branch. If no version is given, the current version will be picked up.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
branch_name
|
str
|
The name of the branch |
required |
version
|
str | None
|
The version of the dataset |
None
|
Examples:
name
instance-attribute
¶
The name of the dataset. Setting the value will immediately persist the change without requiring a commit().
pull
¶
Pulls any new history from the dataset at the passed url into this dataset.
Similar to deeplake.Dataset.push but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
pull_async
¶
pull_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> FutureVoid
Asynchronously pulls any new history from the dataset at the passed url into this dataset.
Similar to deeplake.Dataset.push_async but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
push
¶
Pushes any new history from this dataset to the dataset at the given url
Similar to deeplake.Dataset.pull but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
push_async
¶
push_async(
url: str,
creds: dict[str, str] | None = None,
token: str | None = None,
) -> FutureVoid
Asynchronously Pushes new any history from this dataset to the dataset at the given url
Similar to deeplake.Dataset.pull_async but the other direction.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
url
|
str
|
The URL of the destination dataset |
required |
creds
|
dict[str, str] | None
|
Optional credentials needed to connect to the dataset |
None
|
token
|
str | None
|
Optional deeplake token |
None
|
query
¶
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
¶
query_async(query: str) -> Future[DatasetView]
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
prepare_query
¶
prepare_query(query: str) -> Executor
Prepares a query for execution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to prepare |
required |
Returns:
Name | Type | Description |
---|---|---|
Executor |
Executor
|
The prepared query |
Examples:
explain_query
¶
explain_query(query: str) -> ExplainQueryResult
Explains a query.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to explain |
required |
Returns:
Name | Type | Description |
---|---|---|
ExplainQueryResult |
ExplainQueryResult
|
The result of the explanation |
Examples:
refresh
¶
Refreshes any new info from the dataset.
Similar to deeplake.open_read_only but the lightweight way.
refresh_async
¶
Asynchronously refreshes any new info from the dataset.
Similar to [deeplake.open_read_only_async][] but the lightweight way.
set_creds_key
¶
Sets the key used to store the credentials for the dataset.
to_csv
¶
batches
¶
The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int
|
Number of rows in each batch |
required |
drop_last
|
bool
|
Whether to drop the final batch if it is incomplete |
False
|
Examples:
tensorflow
¶
pytorch
¶
Returns a PyTorch torch.utils.data. Dataset
wrapper around this dataset.
By default, no transformations are applied and each row is returned as a dict
with keys of column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform
|
Callable[[Any], Any]
|
A custom function to apply to each sample before returning it |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If pytorch is not installed |
Examples:
Read-only version of Dataset. Cannot modify data but provides access to all data and metadata.
deeplake.ReadOnlyDataset
¶
Bases: DatasetView
branches
property
¶
branches: BranchesView
The collection of deeplake.BranchViews within the dataset
tag
¶
tag(
name: str | None = None, message: str | None = None
) -> Tag
Saves the current view as a tag to its source dataset and returns the tag.
created_time
property
¶
When the dataset was created. The value is auto-generated at creation time.
id
property
¶
The unique identifier of the dataset. Value is auto-generated at creation time.
query
¶
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
¶
query_async(query: str) -> Future[DatasetView]
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
refresh
¶
Refreshes any new info from the dataset.
Similar to deeplake.open_read_only but the lightweight way.
refresh_async
¶
Asynchronously refreshes any new info from the dataset.
Similar to [deeplake.open_read_only_async][] but the lightweight way.
explain_query
¶
explain_query(query: str) -> ExplainQueryResult
Explains a query.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to explain |
required |
Returns:
Name | Type | Description |
---|---|---|
ExplainQueryResult |
ExplainQueryResult
|
The result of the explanation |
Examples:
prepare_query
¶
prepare_query(query: str) -> Executor
Prepares a query for execution.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
The query to prepare |
required |
Returns:
Name | Type | Description |
---|---|---|
Executor |
Executor
|
The prepared query |
Examples:
to_csv
¶
tensorflow
¶
pytorch
¶
Returns a PyTorch torch.utils.data. Dataset
wrapper around this dataset.
By default, no transformations are applied and each row is returned as a dict
with keys of column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform
|
Callable[[Any], Any]
|
A custom function to apply to each sample before returning it |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If pytorch is not installed |
Examples:
batches
¶
The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int
|
Number of rows in each batch |
required |
drop_last
|
bool
|
Whether to drop the final batch if it is incomplete |
False
|
Examples:
Lightweight view returned by queries. Provides read-only access to query results.
deeplake.DatasetView
¶
A DatasetView is a dataset-like structure. It has a defined schema and contains data which can be queried.
batches
¶
The batches can be used to more efficiently stream large amounts of data from a DeepLake dataset, such as to the DataLoader then to the training framework.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
batch_size
|
int
|
Number of rows in each batch |
required |
drop_last
|
bool
|
Whether to drop the final batch if it is incomplete |
False
|
Examples:
pytorch
¶
Returns a PyTorch torch.utils.data. Dataset
wrapper around this dataset.
By default, no transformations are applied and each row is returned as a dict
with keys of column names.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
transform
|
Callable[[Any], Any]
|
A custom function to apply to each sample before returning it |
None
|
Raises:
Type | Description |
---|---|
ImportError
|
If pytorch is not installed |
Examples:
query
¶
query(query: str) -> DatasetView
Executes the given TQL query against the dataset and return the results as a deeplake.DatasetView.
Examples:
query_async
¶
query_async(query: str) -> Future[DatasetView]
Asynchronously executes the given TQL query against the dataset and return a future that will resolve into deeplake.DatasetView.
Examples:
future = ds.query_async("select * where category == 'active'")
result = future.result()
for row in result:
print("Id is: ", row["id"])
async def query_and_process():
# or use the Future in an await expression
future = ds.query_async("select * where category == 'active'")
result = await future
for row in result:
print("Id is: ", row["id"])
tag
¶
tag(
name: str | None = None, message: str | None = None
) -> Tag
Saves the current view as a tag to its source dataset and returns the tag.
tensorflow
¶
to_csv
¶
Class Comparison¶
Dataset¶
- Full read-write access
- Can create/modify columns
- Can append/update/delete data
- Can commit changes (sync and async)
- Can create version tags and branches
- Can push/pull changes (sync and async)
- Can merge branches
- Auto-commit functionality
- Dataset refresh capabilities
- Full metadata access
ds = deeplake.create("s3://bucket/dataset")
# or
ds = deeplake.open("s3://bucket/dataset")
# Can modify
ds.add_column("images", deeplake.types.Image())
ds.add_column("labels", deeplake.types.ClassLabel("int32"))
ds.add_column("confidence", "float32")
ds["labels"].metadata["class_names"] = ["cat", "dog"]
ds.append([{"images": image_array, "labels": 0, "confidence": 0.9}])
ds.commit()
ReadOnlyDataset¶
- Read-only access
- Cannot modify data or schema
- Can view all data and metadata
- Can execute queries (sync and async)
- Can refresh dataset state
- Access to version history and branches
- Full schema and property access
- Returned by
open_read_only()
ds = deeplake.open_read_only("s3://bucket/dataset")
# Can read
image = ds["images"][0]
metadata = ds.metadata
# Cannot modify
# ds.append([...]) # Would raise error
DatasetView¶
- Read-only access
- Cannot modify data
- Optimized for query results
- Direct integration with ML frameworks (PyTorch, TensorFlow)
- Batch processing capabilities
- Query chaining support
- Export to CSV functionality
- Schema access
- Returned by
query()
and tag operations
# Get view through query
view = ds.query("SELECT *")
# Access data
image = view["images"][0]
# ML framework integration
torch_dataset = view.pytorch()
tf_dataset = view.tensorflow()
Examples¶
Querying Data¶
# Using Dataset
ds = deeplake.open("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")
# Using ReadOnlyDataset
ds = deeplake.open_read_only("s3://bucket/dataset")
results = ds.query("SELECT * WHERE labels = 'cat'")
# Using DatasetView
view = ds.query("SELECT * WHERE labels = 'cat'")
subset = view.query("SELECT * WHERE confidence > 0.9")
Data Access¶
# Common access patterns work on all types
for row in ds: # Works for Dataset, ReadOnlyDataset, and DatasetView
image = row["images"]
label = row["labels"]
# Column access works on all types
images = ds["images"][:]
labels = ds["labels"][:]
Import/Export Data¶
# Import from Parquet file
ds = deeplake.from_parquet("data.parquet")
# or from bytes
f = open("data.parquet", "rb")
ds = deeplake.from_parquet(f.read())
# Import from CSV file
ds = deeplake.from_csv("data.csv")
# or from bytes
f = open("data.csv", "rb")
ds = deeplake.from_csv(f.read())
# Export query results to CSV
view = ds.query("SELECT * WHERE labels = 'cat'")
import io
output = io.StringIO()
view.to_csv(output)
csv_data = output.getvalue()
Async Operations¶
# Async query works on all types
future = ds.query_async("SELECT * WHERE labels = 'cat'")
results = future.result()
# Async data access
future = ds["images"].get_async(slice(0, 1000))
images = future.result()
# # Async dataset operations
# future = ds.commit_async("Updated model predictions")
# future.wait()
# # Async push/pull
# ds.push_async("s3://backup/dataset").wait()
# ds.pull_async("s3://upstream/dataset").wait()
Dataset Management¶
# Check if dataset exists
if deeplake.exists("s3://bucket/dataset"):
ds = deeplake.open("s3://bucket/dataset")
else:
ds = deeplake.create("s3://bucket/dataset")
# Auto-commit functionality
ds.auto_commit_enabled = True # Enable automatic commits
# Refresh dataset to get latest changes
ds.refresh()
# Delete dataset (irreversible!)
deeplake.delete("s3://old-bucket/dataset")
Advanced Query Operations¶
# Global query functions
results = deeplake.query("SELECT * FROM 's3://dataset' WHERE confidence > 0.9")
# Async global queries
future = deeplake.query_async("SELECT * FROM 's3://dataset' LIMIT 1000")
results = future.result()
# Explain query execution plan
plan = deeplake.explain_query("SELECT * FROM 's3://dataset' WHERE labels = 'cat'")
print(plan)
# Prepare reusable query executor
executor = deeplake.prepare_query("SELECT * FROM 's3://dataset' WHERE score > ?")