Dataloader¶

deeplake.experimental.dataloader(dataset) → DeepLakeDataLoader¶

Returns a DeepLakeDataLoader object which can be transformed to either pytorch dataloader or numpy.

Parameters:: dataset – Dataset object on which dataloader needs to be built
Returns:: A DeepLakeDataLoader object.
Return type:: DeepLakeDataLoader

Examples

Creating a simple dataloader object which returns a batch of numpy arrays

>>> import deeplake
>>> from deeplake.experimental import dataloader
>>>
>>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> train_loader = dataloader(ds_train).numpy()
>>> for i, data in enumerate(train_loader):
...     # custom logic on data
...     pass

Creating dataloader with custom transformation and batch size

>>> import torch
>>> from torchvision import datasets, transforms, models
...
>>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> tform = transforms.Compose([
...     transforms.ToPILImage(), # Must convert to PIL image for subsequent operations to run
...     transforms.RandomRotation(20), # Image augmentation
...     transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
...     transforms.Normalize([0.5], [0.5]),
... ])
...
...
>>> batch_size = 32
>>> #create dataloader with chaining transform function and batch size which returns batch of pytorch tensors
>>> train_loader = dataloader(ds_train)
...     .transform({'images': tform, 'labels': None})
...     .batch(batch_size)
...     .shuffle()
...     .pytorch()
...
>>> #loop over the elements
>>> for i, data in enumerate(train_loader):
...     # custom logic on data
...     pass

Creating dataloader and chaning with query

>>> ds = deeplake.load('hub://activeloop/coco-train')
>>> dl = dataloader(ds_train)
...     .query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")
...     .pytorch()
...
>>> #loop over the elements
>>> for i, data in enumerate(train_loader):
...     # custom logic on data
...     pass

class deeplake.experimental.DeepLakeDataLoader(dataset, _batch_size=None, _shuffle=None, _num_threads=None, _num_workers=None, _collate=None, _transform=None, _distributed=None, _prefetch_factor=None, _tensors=None, _drop_last=False, _mode=None, _return_index=None, _primary_tensor_name=None, _buffer_size=None, _tobytes=None)¶

batch(batch_size: int, drop_last: bool = False)¶

Returns a batched DeepLakeDataLoader object.

Parameters:

batch_size (int) – Number of samples in each batch.
drop_last (bool) – If True, the last batch will be dropped if its size is less than batch_size. Defaults to False.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .batch() has already been called.

numpy(num_workers: int = 0, tensors: List[str] | None = None, num_threads: int | None = None, prefetch_factor: int = 2, tobytes: bool | Sequence[str] = False)¶

Returns a DeepLakeDataLoader object.

Parameters:

num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.
tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.
num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.
prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.
tobytes (bool, Sequence[str]) – If True, samples will not be decompressed and their raw bytes will be returned instead of numpy arrays. Can also be a list of tensors, in which case those tensors alone will not be decompressed.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .to_pytorch() or .to_numpy() has already been called.

pytorch(num_workers: int = 0, collate_fn: Callable | None = None, tensors: List[str] | None = None, num_threads: int | None = None, prefetch_factor: int = 2, distributed: bool = False, return_index: bool = True, tobytes: bool | Sequence[str] = False)¶

Returns a DeepLakeDataLoader object.

Parameters:

num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.
collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s).
tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.
num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.
prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.
distributed (bool) – Used for DDP training. Distributes different sections of the dataset to different ranks. Defaults to False.
return_index (bool) – Used to idnetify where loader needs to retur sample index or not. Defaults to True.
tobytes (bool, Sequence[str]) – If True, samples will not be decompressed and their raw bytes will be returned instead of numpy arrays. Can also be a list of tensors, in which case those tensors alone will not be decompressed.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .to_pytorch() or .to_numpy() has already been called.

query(query_string: str)¶

Returns a sliced DeepLakeDataLoader object with given query results. It allows to run SQL like queries on dataset and extract results. See supported keywords and the Tensor Query Language documentation here.

Parameters:: query_string (str) – An SQL string adjusted with new functionalities to run on the dataset object
Returns:: A DeepLakeDataLoader object.
Return type:: DeepLakeDataLoader

Examples

>>> import deeplake
>>> from deeplake.experimental import dataloader
>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> query_ds_train = dataloader(ds_train).query("select * where labels != 5")

>>> import deeplake
>>> from deeplake.experimental import query
>>> ds_train = deeplake.load('hub://activeloop/coco-train')
>>> query_ds_train = dataloader(ds_train).query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")

shuffle(shuffle: bool = True, buffer_size: int = 2048)¶

Returns a shuffled DeepLakeDataLoader object.

Parameters:

shuffle (bool) – shows wheter we need to shuffle elements or not. Defaults to True.
buffer_size (int) – The size of the buffer used to shuffle the data in MBs. Defaults to 2048 MB. Increasing the buffer_size will increase the extent of shuffling.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .shuffle() has already been called.

transform(transform: Callable | Dict[str, Callable | None], **kwargs: Dict)¶

Returns a transformed DeepLakeDataLoader object.

Parameters:

transform (Callable or Dict[Callable]) – A function or dictionary of functions to apply to the data.
kwargs – Additional arguments to be passed to transform. Only applicable if transform is a callable. Ignored if transform is a dictionary.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .transform() has already been called.