Skip to content

Dataset Copying and Synchronization

Deep Lake allows copying and synchronizing datasets across different storage locations. This functionality is needed for the following use cases:

  • Moving datasets between cloud providers
  • Creating local copies of cloud datasets for faster access
  • Backing up datasets to different storage providers
  • Maintaining synchronized dataset replicas

Copying Datasets

Copy a dataset to a new location while preserving all data, metadata, and version history:

# Copy between storage providers
deeplake.copy(
    src="s3://source-bucket/dataset",
    dst="gcs://dest-bucket/dataset", 
    dst_creds={"credentials": "for-dest-storage"}
)

# Create local copy of cloud dataset  
deeplake.copy(
    src="al://org/dataset",
    dst="./local/dataset"
)

Dataset Synchronization

Pull Changes

Sync a dataset with its source by pulling new changes:

# Create dataset copy
deeplake.copy("s3://source/dataset", "s3://replica/dataset")

replica_ds = deeplake.open("s3://replica/dataset")

# Later, pull new changes from source
replica_ds.pull(
    url="s3://source/dataset",
    creds={"aws_access_key_id": "key", "aws_secret_access_key": "secret"}
)

# Pull changes asynchronously
async def pull_async():
    await replica_ds.pull_async("s3://source/dataset") 

Push Changes

Push local changes to another dataset location:

# Make changes to dataset
ds.append({"images": new_images})
ds.commit()

# Push changes to replica
ds.push(
    url="s3://replica/dataset",
    creds={"aws_access_key_id": "key", "aws_secret_access_key": "secret"}
)

# Push changes asynchronously  
async def push_async():
    await ds.push_async("s3://replica/dataset")

Synchronization Example

# Initial dataset creation
source_ds = deeplake.create("s3://bucket/source")
source_ds.add_column("images", deeplake.types.Image())
source_ds.commit()

# Create replica
deeplake.copy(
    src="s3://bucket/source",
    dst="gcs://bucket/replica"
)
replica_ds = deeplake.open("gcs://bucket/replica")

# Add data to source
source_ds.append({"images": batch1})
source_ds.commit()

# Sync replica with source
replica_ds.pull("s3://bucket/source")

# Add data to replica  
replica_ds.append({"images": batch2})
replica_ds.commit()

# Push replica changes back to source
replica_ds.push("s3://bucket/source")

Summary

  • Copying the dataset preserves all data, metadata, and version history
  • Push/pull synchronizes only the changes between datasets
  • Copy/sync works across different storage providers - s3, gcs, azure, local, etc.