Skip to content

Academic Paper

Deeplake: a Lakehouse for Deep Learning

Authors: Activeloop Research Team

Published: arXiv:2209.10785 (2022)

Abstract

Traditional data lakes provide critical data infrastructure for analytical workloads by enabling time travel, running SQL queries, ingesting data with ACID transactions, and visualizing petabyte-scale datasets on cloud storage. They allow organizations to break down data silos, unlock data-driven decision-making, improve operational efficiency, and reduce costs. However, as deep learning usage increases, traditional data lakes are not well-designed for applications such as natural language processing (NLP), audio processing, computer vision, and applications involving non-tabular datasets.

This paper presents Deeplake, an open-source lakehouse for deep learning applications developed at Activeloop. Deeplake maintains the benefits of a vanilla data lake with one key difference: it stores complex data, such as images, videos, annotations, as well as tabular data, in the form of tensors and rapidly streams the data over the network to (a) Tensor Query Language, (b) in-browser visualization engine, or (c) deep learning frameworks without sacrificing GPU utilization. Datasets stored in Deeplake can be accessed from PyTorch, TensorFlow, JAX, and integrate with numerous MLOps tools.

Access the Paper

View on arXiv

Citation

If you use Deeplake in your research, please cite:

@article{deeplake2022,
  title={Deeplake: a Lakehouse for Deep Learning},
  author={Activeloop},
  journal={arXiv preprint arXiv:2209.10785},
  year={2022}
}