v3 to v4 Conversion
Deep Lake 4.0 introduced a new API as well as a new on-disk format that improves performance and scalability.
For datasets created in Deep Lake 3.x and before, you can still query them, but you cannot modify them without first converting them to the new v4 format.
Converting v3 Scripts to v4
While the API has changed between 3.x and 4.0, many of the concepts and methods are similar
Changed Methods
- The
deeplake.load()
method has been renamed to deeplake.open() - The
hub://
prefix for paths has been changed toal://
. You can still usehub://
as an alias, but its use is deprecated.
New Concepts
Transactions & Consistency
In 4.0+, you must explicitly call deeplake.Dataset.commit() for changes to your dataset to be persisted and available to others. This is true for both data changes and schema changes.
ds = deeplake.create("test_dataset")
ds.add_column("col1", deeplake.types.Text())
ds.commit()
ds.append([{"col1": "value1"}])
ds.commit("Added value 1")
Until you call commit()
, the changes are only visible to your local dataset variable.
For more information, see transactions & consistency.
Immutable Versions
You are able to open a previous version of the dataset through the history object and deeplake.Dataset.tag().
For more information, see history & tagging.
Removed Functionality
To streamline functionality, some feature have been removed. Some may return in improved versions in future releases. If you have any questions or suggestions, please let us know in our Slack Community.
- Branches
@deeplake.compute
decorators
Reading v3 Datasets
While you cannot open a v3 dataset with deeplake.open(), you can still query it using the deeplake.query() method.
TQL supports specifying datasets by URL, and v3 datasets can be read this way:
Converting v3 -- Automatic
To convert a v3 dataset to the new v4 format, you can use the deeplake.convert() method.
This copies the all the data from the existing dataset into a new dataset in the v4 format with the same schema as the original dataset. The existing dataset is not modified or deleted.
Converting v3 -- Manual
The automatic conversion above is the recommended way to convert v3 datasets to the new v4 format.
However, if you are looking for more control over the conversion process including using a different schema, you can convert a v3 dataset to the new v4 format manually.
The general steps are:
- Create the new v4 dataset with the desired schema.
- Create a query that reads the data from the v3 dataset.
- Write the data to the new v4 dataset
Example Script
# Create your new dataset
dest_ds = deeplake.create("s3://target/url")
dest_ds.add_column("col1", deeplake.types.Text())
dest_ds.add_column("col2", deeplake.types.Embedding(768))
dest_ds.add_column("col3", deeplake.types.Text(
index_type=deeplake.types.TextIndexType.Inverted))
dest_ds.commit("Added columns")
# Create a query that reads the data from the v3 dataset
source_ds = deeplake.query('select * from "s3://source/url"')
print("Source size: ", len(source_ds))
# Copy the data to the new dataset
# Uses the Prefetcher to speed up the process, and commits every 1M rows
l = deeplake.Prefetcher(source_ds, batch_size=10000)
counter = 0
for b in tqdm(l):
dest_ds.append(b)
counter += 1
if counter > 0 and counter % 100 == 0:
dest_ds.commit()
dest_ds.commit()