Dataloader

Train your models using the new high performance C++ dataloader. See the dataloader method on how to create dataloaders from your datasets:

Dataset.dataloader

Returns a DeepLakeDataLoader object.

DeepLakeDataLoader

class deeplake.enterprise.DeepLakeDataLoader
batch(batch_size: int, drop_last: bool = False)

Returns a batched DeepLakeDataLoader object.

Parameters
  • batch_size (int) – Number of samples in each batch.

  • drop_last (bool) – If True, the last batch will be dropped if its size is less than batch_size. Defaults to False.

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Raises

ValueError – If .batch() has already been called.

close()

Shuts down the workers and releases the resources.

numpy(num_workers: int = 0, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, decode_method: Optional[Dict[str, str]] = None, persistent_workers: bool = False)

Returns a DeepLakeDataLoader object.

Parameters
  • num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.

  • tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.

  • num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.

  • prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.

  • persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. Defaults to False.

  • decode_method (Dict[str, str], Optional) –

    A dictionary of decode methods for each tensor. Defaults to None.

    • Supported decode methods are:-

      ‘numpy’

      Default behaviour. Returns samples as numpy arrays.

      ’tobytes’

      Returns raw bytes of the samples.

      ’pil’

      Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression=’jpeg’ or ‘png’.

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Raises

ValueError – If .pytorch() or .tensorflow() or .numpy() has already been called.

offset(off: int = 0)

Returns a shifted DeepLakeDataLoader object.

Parameters

off (int) – index that the dataloadee will start to iterate.

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Raises

ValueError – If .offset() has already been called.

pytorch(num_workers: int = 0, collate_fn: Optional[Callable] = None, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, distributed: bool = False, return_index: bool = True, decode_method: Optional[Dict[str, str]] = None, persistent_workers: bool = False)

Returns a DeepLakeDataLoader object.

Parameters
  • num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.

  • collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s).

  • tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.

  • num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.

  • prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.

  • distributed (bool) – Used for DDP training. Distributes different sections of the dataset to different ranks. Defaults to False.

  • return_index (bool) – Used to idnetify where loader needs to retur sample index or not. Defaults to True.

  • persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. Defaults to False.

  • decode_method (Dict[str, str], Optional) –

    A dictionary of decode methods for each tensor. Defaults to None.

    • Supported decode methods are:

      ’numpy’

      Default behaviour. Returns samples as numpy arrays.

      ’tobytes’

      Returns raw bytes of the samples.

      ’pil’

      Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression='jpeg' or 'png'.

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Raises

ValueError – If .pytorch() or .tensorflow() or .numpy() has already been called.

Examples

>>> import deeplake
>>> from torchvision import transforms
>>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> tform = transforms.Compose([
...     transforms.RandomRotation(20), # Image augmentation
...     transforms.ToTensor(), # Must convert to pytorch tensor for subsequent operations to run
...     transforms.Normalize([0.5], [0.5]),
... ])
...
>>> batch_size = 32
>>> # create dataloader by chaining with transform function and batch size and returns batch of pytorch tensors
>>> train_loader = ds_train.dataloader()\
...     .transform({'images': tform, 'labels': None})\
...     .batch(batch_size)\
...     .shuffle()\
...     .pytorch(decode_method={'images': 'pil'}) # return samples as PIL images for transforms
...
>>> # iterate over dataloader
>>> for i, sample in enumerate(train_loader):
...     pass
...
query(query_string: str)

Returns a sliced DeepLakeDataLoader object with given query results. It allows to run SQL like queries on dataset and extract results. See supported keywords and the Tensor Query Language documentation here.

Parameters

query_string (str) – An SQL string adjusted with new functionalities to run on the dataset object

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Examples

>>> import deeplake
>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> query_ds_train = ds_train.dataloader().query("select * where labels != 5")
>>> import deeplake
>>> ds_train = deeplake.load('hub://activeloop/coco-train')
>>> query_ds_train = ds_train.dataloader().query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")
sample_by(weights: Union[str, list, tuple, ndarray], replace: Optional[bool] = True, size: Optional[int] = None)

Returns a sliced DeepLakeDataLoader with given weighted sampler applied

Parameters
  • weights – (Union[str, list, tuple, np.ndarray]): If it’s string then tql will be run to calculate the weights based on the expression. list, tuple and ndarray will be treated as the list of the weights per sample

  • replace – Optional[bool] If true the samples can be repeated in the result view. (default: True).

  • size – Optional[int] The length of the result view. (default: len(dataset))

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Examples

Sample the dataloader with labels == 5 twice more than labels == 6

>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> sampled_ds = ds.dataloader().sample_by("max_weight(labels == 5: 10, labels == 6: 5)")

Sample the dataloader treating labels tensor as weights.

>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> sampled_ds = ds.dataloader().sample_by("labels")

Sample the dataloader with the given weights;

>>> ds_train = deeplake.load('hub://activeloop/coco-train')
>>> weights = list()
>>> for i in range(0, len(ds_train)):
...     weights.append(i % 5)
...
>>> sampled_ds = ds.dataloader().sample_by(weights, replace=False)
shuffle(shuffle: bool = True, buffer_size: int = 2048)

Returns a shuffled DeepLakeDataLoader object.

Parameters
  • shuffle (bool) – shows wheter we need to shuffle elements or not. Defaults to True.

  • buffer_size (int) – The size of the buffer used to shuffle the data in MBs. Defaults to 2048 MB. Increasing the buffer_size will increase the extent of shuffling.

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Raises
  • ValueError – If .shuffle() has already been called.

  • ValueError – If dataset is view and shuffle is True

tensorflow(num_workers: int = 0, collate_fn: Optional[Callable] = None, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, return_index: bool = True, decode_method: Optional[Dict[str, str]] = None, persistent_workers: bool = False)

Returns a DeepLakeDataLoader object.

Parameters
  • num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.

  • collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s).

  • tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.

  • num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.

  • prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.

  • return_index (bool) – Used to idnetify where loader needs to retur sample index or not. Defaults to True.

  • persistent_workers (bool) – If True, the data loader will not shutdown the worker processes after a dataset has been consumed once. Defaults to False.

  • decode_method (Dict[str, str], Optional) –

    A dictionary of decode methods for each tensor. Defaults to None.

    • Supported decode methods are:

      ’numpy’

      Default behaviour. Returns samples as numpy arrays.

      ’tobytes’

      Returns raw bytes of the samples.

      ’pil’

      Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression='jpeg' or 'png'.

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Raises

ValueError – If .pytorch() or .tensorflow() or .numpy() has already been called.

Examples

>>> import deeplake
>>> from torchvision import transforms
>>> ds_train = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> batch_size = 32
>>> # create dataloader by chaining with transform function and batch size and returns batch of pytorch tensors
>>> train_loader = ds_train.dataloader()\
...     .batch(batch_size)\
...     .shuffle()\
...     .tensorflow() # return samples as PIL images for transforms
...
>>> # iterate over dataloader
>>> for i, sample in enumerate(train_loader):
...     pass
...
transform(transform: Union[Callable, Dict[str, Optional[Callable]]], **kwargs: Dict)

Returns a transformed DeepLakeDataLoader object.

Parameters
  • transform (Callable or Dict[Callable]) – A function or dictionary of functions to apply to the data.

  • kwargs – Additional arguments to be passed to transform. Only applicable if transform is a callable. Ignored if transform is a dictionary.

Returns

A DeepLakeDataLoader object.

Return type

DeepLakeDataLoader

Raises

ValueError – If .transform() has already been called.