Dataloader

Train your models using the new high performance C++ dataloader. See the dataloader method on how to create dataloaders from your datasets:

Dataset.dataloader

Returns a DeepLakeDataLoader object.

DeepLakeDataLoader

class deeplake.enterprise.DeepLakeDataLoader
batch(batch_size: int, drop_last: bool = False)

Returns a batched DeepLakeDataLoader object.

Parameters:
  • batch_size (int) – Number of samples in each batch.

  • drop_last (bool) – If True, the last batch will be dropped if its size is less than batch_size. Defaults to False.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .batch() has already been called.

numpy(num_workers: int = 0, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, decode_method: Optional[Dict[str, str]] = None)

Returns a DeepLakeDataLoader object.

Parameters:
  • num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.

  • tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.

  • num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.

  • prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.

  • decode_method (Dict[str, str], Optional) –

    A dictionary of decode methods for each tensor. Defaults to None.

    • Supported decode methods are:-

      ‘numpy’:

      Default behaviour. Returns samples as numpy arrays.

      ’tobytes’:

      Returns raw bytes of the samples.

      ’pil’:

      Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression=’jpeg’ or ‘png’.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .pytorch() or .numpy() has already been called.

pytorch(num_workers: int = 0, collate_fn: Optional[Callable] = None, tensors: Optional[List[str]] = None, num_threads: Optional[int] = None, prefetch_factor: int = 2, distributed: bool = False, return_index: bool = True, decode_method: Optional[Dict[str, str]] = None)

Returns a DeepLakeDataLoader object.

Parameters:
  • num_workers (int) – Number of workers to use for transforming and processing the data. Defaults to 0.

  • collate_fn (Callable, Optional) – merges a list of samples to form a mini-batch of Tensor(s).

  • tensors (List[str], Optional) – List of tensors to load. If None, all tensors are loaded. Defaults to None.

  • num_threads (int, Optional) – Number of threads to use for fetching and decompressing the data. If None, the number of threads is automatically determined. Defaults to None.

  • prefetch_factor (int) – Number of batches to transform and collate in advance per worker. Defaults to 2.

  • distributed (bool) – Used for DDP training. Distributes different sections of the dataset to different ranks. Defaults to False.

  • return_index (bool) – Used to idnetify where loader needs to retur sample index or not. Defaults to True.

  • decode_method (Dict[str, str], Optional) –

    A dictionary of decode methods for each tensor. Defaults to None.

    • Supported decode methods are:

      ’numpy’:

      Default behaviour. Returns samples as numpy arrays.

      ’tobytes’:

      Returns raw bytes of the samples.

      ’pil’:

      Returns samples as PIL images. Especially useful when transformation use torchvision transforms, that require PIL images as input. Only supported for tensors with sample_compression=’jpeg’ or ‘png’.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .pytorch() or .numpy() has already been called.

query(query_string: str)

Returns a sliced DeepLakeDataLoader object with given query results. It allows to run SQL like queries on dataset and extract results. See supported keywords and the Tensor Query Language documentation here.

Parameters:

query_string (str) – An SQL string adjusted with new functionalities to run on the dataset object

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Examples

>>> import deeplake
>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> query_ds_train = ds_train.dataloader().query("select * where labels != 5")
>>> import deeplake
>>> ds_train = deeplake.load('hub://activeloop/coco-train')
>>> query_ds_train = ds_train.dataloader().query("(select * where contains(categories, 'car') limit 1000) union (select * where contains(categories, 'motorcycle') limit 1000)")
sample_by(weights: Union[str, list, tuple, ndarray], replace: Optional[bool] = True, size: Optional[int] = None)

Returns a sliced DeepLakeDataLoader with given weighted sampler applied

Parameters:
  • weights – (Union[str, list, tuple, np.ndarray]): If it’s string then tql will be run to calculate the weights based on the expression. list, tuple and ndarray will be treated as the list of the weights per sample

  • replace – Optional[bool] If true the samples can be repeated in the result view. (default: True).

  • size – Optional[int] The length of the result view. (default: len(dataset))

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Examples

Sample the dataloader with labels == 5 twice more than labels == 6

>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> sampled_ds = ds.dataloader().sample_by("max_weight(labels == 5: 10, labels == 6: 5)")

Sample the dataloader treating labels tensor as weights.

>>> ds = deeplake.load('hub://activeloop/fashion-mnist-train')
>>> sampled_ds = ds.dataloader().sample_by("labels")

Sample the dataloader with the given weights;

>>> ds_train = deeplake.load('hub://activeloop/coco-train')
>>> weights = list()
>>> for i in range(0, len(ds_train)):
...     weights.append(i % 5)
...
>>> sampled_ds = ds.dataloader().sample_by(weights, replace=False)
shuffle(shuffle: bool = True, buffer_size: int = 2048)

Returns a shuffled DeepLakeDataLoader object.

Parameters:
  • shuffle (bool) – shows wheter we need to shuffle elements or not. Defaults to True.

  • buffer_size (int) – The size of the buffer used to shuffle the data in MBs. Defaults to 2048 MB. Increasing the buffer_size will increase the extent of shuffling.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:
  • ValueError – If .shuffle() has already been called.

  • ValueError – If dataset is view and shuffle is True

transform(transform: Union[Callable, Dict[str, Optional[Callable]]], **kwargs: Dict)

Returns a transformed DeepLakeDataLoader object.

Parameters:
  • transform (Callable or Dict[Callable]) – A function or dictionary of functions to apply to the data.

  • kwargs – Additional arguments to be passed to transform. Only applicable if transform is a callable. Ignored if transform is a dictionary.

Returns:

A DeepLakeDataLoader object.

Return type:

DeepLakeDataLoader

Raises:

ValueError – If .transform() has already been called.