deeplake.core.storage

Base Storage Provider

class deeplake.core.storage.StorageProvider
abstract __delitem__(path: str)

Delete the object present at the path.

Parameters:

path (str) – the path to the object relative to the root of the provider.

Raises:

KeyError – If an object is not found at the path.

abstract __getitem__(path: str)

Gets the object present at the path within the given byte range.

Parameters:

path (str) – The path relative to the root of the provider.

Returns:

The bytes of the object present at the path.

Return type:

bytes

Raises:

KeyError – If an object is not found at the path.

abstract __iter__()

Generator function that iterates over the keys of the provider.

Yields:

str – the path of the object that it is iterating over, relative to the root of the provider.

abstract __len__()

Returns the number of files present inside the root of the provider.

Returns:

the number of files present inside the root.

Return type:

int

abstract __setitem__(path: str, value: bytes)

Sets the object present at the path with the value

Parameters:
  • path (str) – the path relative to the root of the provider.

  • value (bytes) – the value to be assigned at the path.

__weakref__

list of weak references to the object (if defined)

abstract _all_keys() Set[str]

Generator function that iterates over the keys of the provider.

Returns:

set of all keys present at the root of the provider.

Return type:

set

_is_hub_path = False

An abstract base class for implementing a storage provider.

To add a new provider using Provider, create a subclass and implement all 5 abstract methods below.

check_readonly()

Raises an exception if the provider is in read-only mode.

abstract clear(prefix='')

Delete the contents of the provider.

copy()

Returns a copy of the provider.

Returns:

A copy of the provider.

Return type:

StorageProvider

disable_readonly()

Disables read-only mode for the provider.

enable_readonly()

Enables read-only mode for the provider.

flush()

Only needs to be implemented for caches. Flushes the data to the next storage provider. Should be a no op for Base Storage Providers like local, s3, azure, gcs, etc.

get_bytes(path: str, start_byte: int | None = None, end_byte: int | None = None)

Gets the object present at the path within the given byte range.

Parameters:
  • path (str) – The path relative to the root of the provider.

  • start_byte (int, optional) – If only specific bytes starting from start_byte are required.

  • end_byte (int, optional) – If only specific bytes up to end_byte are required.

Returns:

The bytes of the object present at the path within the given byte range.

Return type:

bytes

Raises:
  • InvalidBytesRequestedError – If start_byte > end_byte or start_byte < 0 or end_byte < 0.

  • KeyError – If an object is not found at the path.

maybe_flush()

Flush cache if autoflush has been enabled. Called at the end of methods which write data, to ensure consistency as a default.

set_bytes(path: str, value: bytes, start_byte: int | None = None, overwrite: bool | None = False)

Sets the object present at the path with the value

Parameters:
  • path (str) – the path relative to the root of the provider.

  • value (bytes) – the value to be assigned at the path.

  • start_byte (int, optional) – If only specific bytes starting from start_byte are to be assigned.

  • overwrite (boolean, optional) – If the value is True, if there is an object present at the path it is completely overwritten, without fetching it’s data.

Raises:

LRU Cache

class deeplake.core.storage.LRUCache

Bases: StorageProvider

LRU Cache that uses StorageProvider for caching

__delitem__(path: str)

Deletes the object present at the path from the cache and the underlying storage.

Parameters:

path (str) – the path to the object relative to the root of the provider.

Raises:
  • KeyError – If an object is not found at the path.

  • ReadOnlyError – If the provider is in read-only mode.

__getitem__(path: str)

If item is in cache_storage, retrieves from there and returns. If item isn’t in cache_storage, retrieves from next storage, stores in cache_storage (if possible) and returns.

Parameters:

path (str) – The path relative to the root of the underlying storage.

Raises:

KeyError – if an object is not found at the path.

Returns:

The bytes of the object present at the path.

Return type:

bytes

__getstate__() Dict[str, Any]

Returns the state of the cache, for pickling

__init__(cache_storage: StorageProvider, next_storage: StorageProvider | None, cache_size: int)

Initializes the LRUCache. It can be chained with other LRUCache objects to create multilayer caches.

Parameters:
  • cache_storage (StorageProvider) – The storage being used as the caching layer of the cache. This should be a base provider such as MemoryProvider, LocalProvider or S3Provider but not another LRUCache.

  • next_storage (StorageProvider) – The next storage layer of the cache. This can either be a base provider (i.e. it is the final storage) or another LRUCache (i.e. in case of chained cache). While reading data, all misses from cache would be retrieved from here. While writing data, the data will be written to the next_storage when cache_storage is full or flush is called.

  • cache_size (int) – The total space that can be used from the cache_storage in bytes. This number may be less than the actual space available on the cache_storage. Setting it to a higher value than actually available space may lead to unexpected behaviors.

__iter__()

Generator function that iterates over the keys of the cache and the underlying storage.

Yields:

str – the path of the object that it is iterating over, relative to the root of the provider.

__len__()

Returns the number of files present in the cache and the underlying storage.

Returns:

the number of files present inside the root.

Return type:

int

__setitem__(path: str, value: bytes | DeepLakeMemoryObject)

Puts the item in the cache_storage (if possible), else writes to next_storage.

Parameters:
  • path (str) – the path relative to the root of the underlying storage.

  • value (bytes) – the value to be assigned at the path.

Raises:

ReadOnlyError – If the provider is in read-only mode.

__setstate__(state: Dict[str, Any])

Recreates a cache with the same configuration as the state.

Parameters:

state (dict) – The state to be used to recreate the cache.

Note

While restoring the cache, we reset its contents. In case the cache storage was local/s3 and is still accessible when unpickled (if same machine/s3 creds present respectively), the earlier cache contents are no longer accessible.

_all_keys()

Helper function that lists all the objects present in the cache and the underlying storage.

Returns:

set of all the objects found in the cache and the underlying storage.

Return type:

set

_flush_if_not_read_only()

Flushes the cache if not in read-only mode.

_forward(path)

Forward the value at a given path to the next storage, and un-marks its key.

_forward_value(path, value)

Forwards a path-value pair to the next storage, and un-marks its key.

Parameters:
  • path (str) – the path to the object relative to the root of the provider.

  • value (bytes, DeepLakeMemoryObject) – the value to send to the next storage.

_free_up_space(extra_size: int)
Helper function that frees up space the requred space in cache.

No action is taken if there is sufficient space in the cache.

Parameters:

extra_size (int) – the space that needs is required in bytes.

_insert_in_cache(path: str, value: bytes | DeepLakeMemoryObject)

Helper function that adds a key value pair to the cache.

Parameters:
  • path (str) – the path relative to the root of the underlying storage.

  • value (bytes) – the value to be assigned at the path.

Raises:

ReadOnlyError – If the provider is in read-only mode.

_pop_from_cache()

Helper function that pops the least recently used key, value pair from the cache

clear(prefix='')

Deletes ALL the data from all the layers of the cache and the actual storage. This is an IRREVERSIBLE operation. Data once deleted can not be recovered.

clear_cache()

Flushes the content of all the cache layers if not in read mode and and then deletes contents of all the layers of it. This doesn’t delete data from the actual storage.

clear_deeplake_objects()

Removes all DeepLakeMemoryObjects from the cache.

flush()

Writes data from cache_storage to next_storage. Only the dirty keys are written. This is a cascading function and leads to data being written to the final storage in case of a chained cache.

get_bytes(path: str, start_byte: int | None = None, end_byte: int | None = None)

Gets the object present at the path within the given byte range.

Parameters:
  • path (str) – The path relative to the root of the provider.

  • start_byte (int, optional) – If only specific bytes starting from start_byte are required.

  • end_byte (int, optional) – If only specific bytes up to end_byte are required.

Returns:

The bytes of the object present at the path within the given byte range.

Return type:

bytes

Raises:
  • InvalidBytesRequestedError – If start_byte > end_byte or start_byte < 0 or end_byte < 0.

  • KeyError – If an object is not found at the path.

get_deeplake_object(path: str, expected_class, meta: Dict | None = None, url=False, partial_bytes: int = 0)

If the data at path was stored using the output of a DeepLakeMemoryObject’s tobytes function, this function will read it back into object form & keep the object in cache.

Parameters:
  • path (str) – Path to the stored object.

  • expected_class (callable) – The expected subclass of DeepLakeMemoryObject.

  • meta (dict, optional) – Metadata associated with the stored object

  • url (bool) – Get presigned url instead of downloading chunk (only for videos)

  • partial_bytes (int) – Number of bytes to read from the beginning of the file. If 0, reads the whole file. Defaults to 0.

Raises:
  • ValueError – If the incorrect expected_class was provided.

  • ValueError – If the type of the data at path is invalid.

  • ValueError – If url is True but expected_class is not a subclass of BaseChunk.

Returns:

An instance of expected_class populated with the data.

register_deeplake_object(path: str, obj: DeepLakeMemoryObject)

Registers a new object in the cache.

remove_deeplake_object(path: str)

Removes a DeepLakeMemoryObject from the cache.

S3 Storage Provider

class deeplake.core.storage.S3Provider

Bases: StorageProvider

Provider class for using S3 storage.

__delitem__(path)

Delete the object present at the path.

Parameters:

path (str) – the path to the object relative to the root of the S3Provider.

Note

If the object is not found, s3 won’t raise KeyError.

Raises:
  • S3DeletionError – Any S3 error encountered while deleting the object.

  • ReadOnlyError – If the provider is in read-only mode.

__getitem__(path)

Gets the object present at the path.

Parameters:

path (str) – the path relative to the root of the S3Provider.

Returns:

The bytes of the object present at the path.

Return type:

bytes

Raises:
  • KeyError – If an object is not found at the path.

  • S3GetError – Any other error other than KeyError while retrieving the object.

__init__(root: str, aws_access_key_id: str | None = None, aws_secret_access_key: str | None = None, aws_session_token: str | None = None, endpoint_url: str | None = None, aws_region: str | None = None, profile_name: str | None = None, token: str | None = None)

Initializes the S3Provider

Example

>>> s3_provider = S3Provider("snark-test/benchmarks")
Parameters:
  • root (str) – The root of the provider. All read/write request keys will be appended to root.

  • aws_access_key_id (str, optional) – Specifies the AWS access key used as part of the credentials to authenticate the user.

  • aws_secret_access_key (str, optional) – Specifies the AWS secret key used as part of the credentials to authenticate the user.

  • aws_session_token (str, optional) – Specifies an AWS session token used as part of the credentials to authenticate the user.

  • endpoint_url (str, optional) – The complete URL to use for the constructed client. This needs to be provided for cases in which you’re interacting with MinIO, Wasabi, etc.

  • aws_region (str, optional) – Specifies the AWS Region to send requests to.

  • profile_name (str, optional) – Specifies the AWS profile name to use.

  • token (str, optional) – Activeloop token, used for fetching credentials for Deep Lake datasets (if this is underlying storage for Deep Lake dataset). This is optional, tokens are normally autogenerated.

__iter__()

Generator function that iterates over the keys of the S3Provider.

Yields:

str – the name of the object that it is iterating over.

__len__()

Returns the number of files present at the root of the S3Provider.

Note

This is an expensive operation.

Returns:

the number of files present inside the root.

Return type:

int

Raises:

S3ListError – Any S3 error encountered while listing the objects.

__setitem__(path, content)

Sets the object present at the path with the value

Parameters:
  • path (str) – the path relative to the root of the S3Provider.

  • content (bytes) – the value to be assigned at the path.

Raises:
  • S3SetError – Any S3 error encountered while setting the value at the path.

  • ReadOnlyError – If the provider is in read-only mode.

_all_keys()

Helper function that lists all the objects present at the root of the S3Provider.

Returns:

set of all the objects found at the root of the S3Provider.

Return type:

set

Raises:

S3ListError – Any S3 error encountered while listing the objects.

_check_update_creds(force=False)

If the client has an expiration time, check if creds are expired and fetch new ones. This would only happen for datasets stored on Deep Lake storage for which temporary 12 hour credentials are generated.

_set_hub_creds_info(hub_path: str, expiration: str)

Sets the tag and expiration of the credentials. These are only relevant to datasets using Deep Lake storage. This info is used to fetch new credentials when the temporary 12 hour credentials expire.

Parameters:
  • hub_path (str) – The Deep Lake cloud path to the dataset.

  • expiration (str) – The time at which the credentials expire.

_state_keys()

Keys used to store the state of the provider.

clear(prefix='')

Deletes ALL data with keys having given prefix on the s3 bucket (under self.root).

Warning

Exercise caution!

get_bytes(path: str, start_byte: int | None = None, end_byte: int | None = None)

Gets the object present at the path within the given byte range.

Parameters:
  • path (str) – The path relative to the root of the provider.

  • start_byte (int, optional) – If only specific bytes starting from start_byte are required.

  • end_byte (int, optional) – If only specific bytes up to end_byte are required.

Returns:

The bytes of the object present at the path within the given byte range.

Return type:

bytes

Raises:
  • InvalidBytesRequestedError – If start_byte > end_byte or start_byte < 0 or end_byte < 0.

  • KeyError – If an object is not found at the path.

  • S3GetError – Any other error other than KeyError while retrieving the object.

need_to_reload_creds(err: ClientError) bool

Checks if the credentials need to be reloaded. This happens if the credentials were loaded from the environment and have now expired.

rename(root)

Rename root folder.

Google Cloud Storage Provider

class deeplake.core.storage.GCSProvider

Bases: StorageProvider

Provider class for using GC storage.

__contains__(key)

Checks if key exists in mapping.

__delitem__(key)

Remove key.

__getitem__(key)

Retrieve data.

__init__(root: str, token: str | Dict | None = None, project: str | None = None)

Initializes the GCSProvider.

Example

>>> gcs_provider = GCSProvider("gcs://my-bucket/gcs_ds")
Parameters:
  • root (str) – The root of the provider. All read/write request keys will be appended to root.

  • token (str/Dict) – GCP token, used for fetching credentials for storage). Can be a path to the credentials file, actual credential dictionary or one of the folowing: - google_default: Tries to load default credentials for the specified project. - cache: Retrieves the previously used credentials from cache if exist. - anon: Sets credentials=None. - browser: Generates and stores new token file using cli.

  • project (str) – Name of the project from GCloud.

Raises:

ModuleNotFoundError – If google cloud packages aren’t installed.

__iter__()

Iterating over the structure.

__len__()

Returns length of the structure.

__setitem__(key, value)

Store value in key.

_all_keys()

Generator function that iterates over the keys of the provider.

Returns:

set of all keys present at the root of the provider.

Return type:

set

_set_hub_creds_info(hub_path: str, expiration: str)

Sets the tag and expiration of the credentials. These are only relevant to datasets using Deep Lake storage. This info is used to fetch new credentials when the temporary 12 hour credentials expire.

Parameters:
  • hub_path (str) – The deeplake cloud path to the dataset.

  • expiration (str) – The time at which the credentials expire.

clear(prefix='')

Remove all keys with given prefix below root - empties out mapping.

Warning

Exercise caution!

get_bytes(path: str, start_byte: int | None = None, end_byte: int | None = None)

Gets the object present at the path within the given byte range.

Parameters:
  • path (str) – The path relative to the root of the provider.

  • start_byte (int, optional) – If only specific bytes starting from start_byte are required.

  • end_byte (int, optional) – If only specific bytes up to end_byte are required.

Returns:

The bytes of the object present at the path within the given byte range.

Return type:

bytes

Raises:
  • InvalidBytesRequestedError – If start_byte > end_byte or start_byte < 0 or end_byte < 0.

  • KeyError – If an object is not found at the path.

rename(root)

Rename root folder.

Google Drive Storage Provider

class deeplake.core.storage.GDriveProvider

Bases: StorageProvider

Provider class for using Google Drive storage.

__delitem__(path)

Delete the object present at the path.

Parameters:

path (str) – the path to the object relative to the root of the provider.

Raises:

KeyError – If an object is not found at the path.

__getitem__(path)

Gets the object present at the path within the given byte range.

Parameters:

path (str) – The path relative to the root of the provider.

Returns:

The bytes of the object present at the path.

Return type:

bytes

Raises:

KeyError – If an object is not found at the path.

__init__(root: str, token: str | Dict | None = None, makemap: bool = True)

Initializes the GDriveProvider

Example

>>> gdrive_provider = GDriveProvider("gdrive://folder_name/folder_name")
Parameters:
  • root (str) – The root of the provider. All read/write request keys will be appended to root.

  • token (dict, str, optional) – Google Drive token. Can be path to the token file or the actual credentials dictionary.

  • makemap (bool) – Creates path to id map if True.

Note

  • Requires client_secrets.json in working directory if token is not provided.

  • Due to limits on requests per 100 seconds on google drive api, continuous requests such as uploading many small files can be slow.

  • Users can request to increse their quotas on their google cloud platform.

__iter__()

Generator function that iterates over the keys of the provider.

Yields:

str – the path of the object that it is iterating over, relative to the root of the provider.

__len__()

Returns the number of files present inside the root of the provider.

Returns:

the number of files present inside the root.

Return type:

int

__setitem__(path, content)

Sets the object present at the path with the value

Parameters:
  • path (str) – the path relative to the root of the provider.

  • value (bytes) – the value to be assigned at the path.

_all_keys()

Generator function that iterates over the keys of the provider.

Returns:

set of all keys present at the root of the provider.

Return type:

set

clear(prefix='')

Delete the contents of the provider.

sync()

Sync provider keys with actual storage

Local Storage Provider

class deeplake.core.storage.LocalProvider

Bases: StorageProvider

Provider class for using the local filesystem.

__delitem__(path: str)

Delete the object present at the path.

Example

>>> local_provider = LocalProvider("/home/ubuntu/Documents/")
>>> del local_provider["abc.txt"]
Parameters:

path (str) – the path to the object relative to the root of the provider.

Raises:
  • KeyError – If an object is not found at the path.

  • DirectoryAtPathException – If a directory is found at the path.

  • Exception – Any other exception encountered while trying to fetch the object.

  • ReadOnlyError – If the provider is in read-only mode.

__getitem__(path: str)

Gets the object present at the path within the given byte range.

Example

>>> local_provider = LocalProvider("/home/ubuntu/Documents/")
>>> my_data = local_provider["abc.txt"]
Parameters:

path (str) – The path relative to the root of the provider.

Returns:

The bytes of the object present at the path.

Return type:

bytes

Raises:
  • KeyError – If an object is not found at the path.

  • DirectoryAtPathException – If a directory is found at the path.

  • Exception – Any other exception encountered while trying to fetch the object.

__init__(root: str)

Initializes the LocalProvider.

Example

>>> local_provider = LocalProvider("/home/ubuntu/Documents/")
Parameters:

root (str) – The root of the provider. All read/write request keys will be appended to root.”

Raises:

FileAtPathException – If the root is a file instead of a directory.

__iter__()

Generator function that iterates over the keys of the provider.

Example

>>> local_provider = LocalProvider("/home/ubuntu/Documents/")
>>> for my_data in local_provider:
...    pass
Yields:

str – the path of the object that it is iterating over, relative to the root of the provider.

__len__()

Returns the number of files present inside the root of the provider.

Example

>>> local_provider = LocalProvider("/home/ubuntu/Documents/")
>>> len(local_provider)
Returns:

the number of files present inside the root.

Return type:

int

__setitem__(path: str, value: bytes)

Sets the object present at the path with the value

Example

>>> local_provider = LocalProvider("/home/ubuntu/Documents/")
>>> local_provider["abc.txt"] = b"abcd"
Parameters:
  • path (str) – the path relative to the root of the provider.

  • value (bytes) – the value to be assigned at the path.

Raises:
  • Exception – If unable to set item due to directory at path or permission or space issues.

  • FileAtPathException – If the directory to the path is a file instead of a directory.

  • ReadOnlyError – If the provider is in read-only mode.

_all_keys(refresh: bool = False) Set[str]

Lists all the objects present at the root of the Provider.

Parameters:

refresh (bool) – refresh keys

Returns:

set of all the objects found at the root of the Provider.

Return type:

set

_check_is_file(path: str)

Checks if the path is a file. Returns the full_path to file if True.

Parameters:

path (str) – the path to the object relative to the root of the provider.

Returns:

the full path to the requested file.

Return type:

str

Raises:

DirectoryAtPathException – If a directory is found at the path.

clear(prefix='')

Deletes ALL data with keys having given prefix on the local machine (under self.root). Exercise caution!

get_bytes(path: str, start_byte: int | None = None, end_byte: int | None = None)

Gets the object present at the path within the given byte range.

Parameters:
  • path (str) – The path relative to the root of the provider.

  • start_byte (int, optional) – If only specific bytes starting from start_byte are required.

  • end_byte (int, optional) – If only specific bytes up to end_byte are required.

Returns:

The bytes of the object present at the path within the given byte range.

Return type:

bytes

Raises:
  • InvalidBytesRequestedError – If start_byte > end_byte or start_byte < 0 or end_byte < 0.

  • KeyError – If an object is not found at the path.

rename(path)

Renames root folder

Memory Provider

class deeplake.core.storage.MemoryProvider

Bases: StorageProvider

Provider class for using the memory.

__delitem__(path: str)

Delete the object present at the path.

Example

>>> memory_provider = MemoryProvider("xyz")
>>> del memory_provider["abc.txt"]
Parameters:

path (str) – the path to the object relative to the root of the provider.

Raises:
  • KeyError – If an object is not found at the path.

  • ReadOnlyError – If the provider is in read-only mode.

__getitem__(path: str)

Gets the object present at the path within the given byte range.

Example

>>> memory_provider = MemoryProvider("xyz")
>>> my_data = memory_provider["abc.txt"]
Parameters:

path (str) – The path relative to the root of the provider.

Returns:

The bytes of the object present at the path.

Return type:

bytes

Raises:

KeyError – If an object is not found at the path.

__getstate__() str

Does NOT save the in memory data in state.

__init__(root: str = '')
__iter__()

Generator function that iterates over the keys of the provider.

Example

>>> memory_provider = MemoryProvider("xyz")
>>> for my_data in memory_provider:
...    pass
Yields:

str – the path of the object that it is iterating over, relative to the root of the provider.

__len__()

Returns the number of files present inside the root of the provider.

Example

>>> memory_provider = MemoryProvider("xyz")
>>> len(memory_provider)
Returns:

the number of files present inside the root.

Return type:

int

__setitem__(path: str, value: bytes)

Sets the object present at the path with the value

Example

>>> memory_provider = MemoryProvider("xyz")
>>> memory_provider["abc.txt"] = b"abcd"
Parameters:
  • path (str) – the path relative to the root of the provider.

  • value (bytes) – the value to be assigned at the path.

Raises:

ReadOnlyError – If the provider is in read-only mode.

_all_keys()

Lists all the objects present at the root of the Provider.

Returns:

set of all the objects found at the root of the Provider.

Return type:

set

clear(prefix='')

Clears the provider.