Skip to content

Massive Data Ingestion

When dealing with petabyte-scale data lakes or millions of video segments, standard row-by-row insertion becomes a major bottleneck. This example focuses on tuning the Deeplake Managed Service for maximum throughput.

Objective

Ingest 100,000+ files efficiently using buffered writes, parallel file normalization, and high-concurrency storage uploads.

Prerequisites

  • Deeplake SDK: pip install deeplake
  • A high-bandwidth network connection (S3/GCS optimized).
  • A Deeplake API token.

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Complete Code

from deeplake import Client

# 1. Setup
client = Client()

# 2. Performance-Tuned Ingestion
# Deeplake's SDK is optimized for high-throughput multimodal data.
# We can tune normalization and buffering to release CPU/Network bottlenecks.

# Generate a list of paths (e.g., from a cloud bucket or local RAID)
file_list = [f"data/video_{i}.mp4" for i in range(10000)]

print(f"Starting massive ingestion of {len(file_list)} files...")

result = client.ingest(
    table_name="massive_video_lake",
    data={"path": file_list},
    schema={"path": "FILE"},
    # Performance Overrides:
    # normalization_workers: parallel threads for ffmpeg/PDF rendering (default 4)
    # flush_every: rows buffered in memory before disk write (default 200)
    # commit_every: rows between ds.commit() checkpoints (default 2000)
    # normalization_workers=8,
    # flush_every=1000,
    # commit_every=10000
)

print(f"Ingested {result['row_count']} segments into the lake.")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
TABLE="massive_video_lake"

# 1. Create the table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (id BIGSERIAL PRIMARY KEY, filename TEXT, metadata JSONB) USING deeplake"
  }'

# 2. Batch insert metadata via the SQL endpoint to reduce round-trip latency.
# A single HTTP request can insert many rows at once using params_batch.
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (filename, metadata) VALUES ($1, $2::jsonb), ($3, $4::jsonb), ($5, $6::jsonb)",
    "params": ["video_001.mp4", "{\"duration\": 120, \"fps\": 30}", "video_002.mp4", "{\"duration\": 95, \"fps\": 60}", "video_003.mp4", "{\"duration\": 200, \"fps\": 30}"]
  }'

# Note: Use REST for metadata, but always use the SDK for binary/tensor ingestion.

Performance Breakdown

1. Parallel Normalization

When ingesting videos, images, or PDFs, the SDK must "normalize" the files (e.g., calling ffmpeg or PyMuPDF). By setting normalization_workers (internal default is 4), Deeplake processes multiple files simultaneously, utilizing all CPU cores and releasing the GIL for real parallelism.

2. Buffered Writes (flush_every)

Deeplake's storage engine works best with large chunks of data. Instead of writing every row to the lake immediately, the SDK buffers them in memory. Increasing flush_every reduces the frequency of Python-to-C++ transitions, which is critical when ingesting millions of small text chunks.

3. Storage Concurrency

The SDK automatically sets deeplake.storage.set_concurrency(32) during ingestion. This ensures that network-bound uploads to S3 or Google Cloud Storage happen over dozens of parallel connections, saturating your available bandwidth.

Tuning Guide

Parameter Recommended for Small Data Recommended for Massive Data
flush_every 200 1,000 - 5,000
commit_every 2,000 10,000 - 50,000
normalization_workers 4 8 - 16 (per CPU core)
File Type Mix of types Group by type for consistent processing

What to try next