Massive Data Ingestion¶
When dealing with petabyte-scale data lakes or millions of video segments, standard row-by-row insertion becomes a major bottleneck. This example focuses on tuning the Deeplake Managed Service for maximum throughput.
Objective¶
Ingest 100,000+ files efficiently using buffered writes, parallel file normalization, and high-concurrency storage uploads.
Prerequisites¶
- Deeplake SDK:
pip install deeplake - A high-bandwidth network connection (S3/GCS optimized).
- A Deeplake API token.
Set credentials first
Complete Code¶
from deeplake import Client
# 1. Setup
client = Client()
# 2. Performance-Tuned Ingestion
# Deeplake's SDK is optimized for high-throughput multimodal data.
# We can tune normalization and buffering to release CPU/Network bottlenecks.
# Generate a list of paths (e.g., from a cloud bucket or local RAID)
file_list = [f"data/video_{i}.mp4" for i in range(10000)]
print(f"Starting massive ingestion of {len(file_list)} files...")
result = client.ingest(
table_name="massive_video_lake",
data={"path": file_list},
schema={"path": "FILE"},
# Performance Overrides:
# normalization_workers: parallel threads for ffmpeg/PDF rendering (default 4)
# flush_every: rows buffered in memory before disk write (default 200)
# commit_every: rows between ds.commit() checkpoints (default 2000)
# normalization_workers=8,
# flush_every=1000,
# commit_every=10000
)
print(f"Ingested {result['row_count']} segments into the lake.")
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
TABLE="massive_video_lake"
# 1. Create the table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE TABLE IF NOT EXISTS \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (id BIGSERIAL PRIMARY KEY, filename TEXT, metadata JSONB) USING deeplake"
}'
# 2. Batch insert metadata via the SQL endpoint to reduce round-trip latency.
# A single HTTP request can insert many rows at once using params_batch.
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "INSERT INTO \"'$DEEPLAKE_WORKSPACE'\".\"'$TABLE'\" (filename, metadata) VALUES ($1, $2::jsonb), ($3, $4::jsonb), ($5, $6::jsonb)",
"params": ["video_001.mp4", "{\"duration\": 120, \"fps\": 30}", "video_002.mp4", "{\"duration\": 95, \"fps\": 60}", "video_003.mp4", "{\"duration\": 200, \"fps\": 30}"]
}'
# Note: Use REST for metadata, but always use the SDK for binary/tensor ingestion.
Performance Breakdown¶
1. Parallel Normalization¶
When ingesting videos, images, or PDFs, the SDK must "normalize" the files (e.g., calling ffmpeg or PyMuPDF). By setting normalization_workers (internal default is 4), Deeplake processes multiple files simultaneously, utilizing all CPU cores and releasing the GIL for real parallelism.
2. Buffered Writes (flush_every)¶
Deeplake's storage engine works best with large chunks of data. Instead of writing every row to the lake immediately, the SDK buffers them in memory. Increasing flush_every reduces the frequency of Python-to-C++ transitions, which is critical when ingesting millions of small text chunks.
3. Storage Concurrency¶
The SDK automatically sets deeplake.storage.set_concurrency(32) during ingestion. This ensures that network-bound uploads to S3 or Google Cloud Storage happen over dozens of parallel connections, saturating your available bandwidth.
Tuning Guide¶
| Parameter | Recommended for Small Data | Recommended for Massive Data |
|---|---|---|
flush_every |
200 | 1,000 - 5,000 |
commit_every |
2,000 | 10,000 - 50,000 |
normalization_workers |
4 | 8 - 16 (per CPU core) |
| File Type | Mix of types | Group by type for consistent processing |
What to try next¶
- GPU-Streaming Guide: now that your data is in the lake, train on it.
- Video Retrieval: specific tuning for video-heavy ingestion.
- Indexes Guide: ensure your indexes scale with your data.