Modal¶
Modal is a serverless cloud platform that lets you run Python functions on GPU infrastructure without managing servers. Functions spin up in seconds with the exact GPU you need, and shut down automatically when done. You only pay for what you use.
Deeplake + Modal is a natural fit for fine-tuning: your training data lives in Deeplake, Modal provides the GPU, and Deeplake's PyTorch streaming dataloader feeds data directly from cloud storage into the training loop. No dataset download, no local disk, no data pipeline to maintain.
Objective¶
Fine-tune Qwen3-32B with QLoRA on a Modal L40S GPU using Unsloth for 2-5x faster training, streaming the dataset from Deeplake.
Architecture¶
┌──────────────────────────┐
│ Your laptop │
│ modal run finetune.py │
└────────────┬─────────────┘
│
┌────────▼────────────────────────┐
│ Modal (L40S 48GB) │
│ │
│ Qwen3-32B (QLoRA, 4-bit) │
│ ▲ │
│ │ ds.pytorch() │
│ │ (streaming dataloader) │
│ │ │
│ Deeplake SDK ──── streams ────┼──► Deeplake Cloud
│ │ (training data)
└─────────────────────────────────┘
Data is streamed on-demand from Deeplake's object storage directly into GPU memory. The training machine never downloads the full dataset. The memory footprint stays constant whether the dataset is 1 GB or 1 TB.
Prerequisites¶
Store your Deeplake token as a Modal secret:
modal secret create deeplake-creds \
DEEPLAKE_API_KEY=YOUR_TOKEN \
DEEPLAKE_WORKSPACE=YOUR_WORKSPACE
Set credentials first
Step 1: Prepare Training Data in Deeplake¶
Upload your instruction-tuning dataset to a Deeplake table. Each row has a prompt and a completion:
from deeplake import Client
client = Client()
# Ingest the dataset
client.ingest("sft_dataset", {
"instruction": [
"Explain quantum computing in simple terms.",
"Write a Python function to merge two sorted lists.",
"Summarize the key ideas of the attention mechanism.",
# ... thousands more rows
],
"response": [
"Quantum computing uses qubits that can be 0 and 1 at the same time...",
"def merge_sorted(a, b):\n result = []\n i = j = 0\n while i < len(a) and j < len(b):\n ...",
"The attention mechanism allows a model to focus on relevant parts of the input...",
# ... matching completions
],
})
# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"
# Create the table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "CREATE TABLE IF NOT EXISTS \"'"$DEEPLAKE_WORKSPACE"'\".\"sft_dataset\" (id SERIAL PRIMARY KEY, instruction TEXT, response TEXT) USING deeplake"
}'
# Insert training examples (batch)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $DEEPLAKE_API_KEY" \
-H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
-d '{
"query": "INSERT INTO \"'"$DEEPLAKE_WORKSPACE"'\".\"sft_dataset\" (instruction, response) VALUES ($1, $2), ($3, $4), ($5, $6)",
"params": ["Explain quantum computing in simple terms.", "Quantum computing uses qubits that can be 0 and 1 at the same time...", "Write a Python function to merge two sorted lists.", "def merge_sorted(a, b):\n result = []\n ...", "Summarize the key ideas of the attention mechanism.", "The attention mechanism allows a model to focus on relevant parts of the input..."]
}'
Step 2: Fine-Tune on Modal¶
This is the complete fine-tuning script. It follows Modal's recommended Unsloth setup, replacing the dataset source with Deeplake streaming:
import modal
app = modal.App("deeplake-finetune")
# Container image: Unsloth handles torch, transformers, peft, trl, bitsandbytes
finetune_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"unsloth[cu124-torch260]",
"deeplake",
)
)
# Persist adapter weights across runs
output_vol = modal.Volume.from_name("finetune-output", create_if_missing=True)
# Cache model weights so they're not re-downloaded every run
model_vol = modal.Volume.from_name("model-cache", create_if_missing=True)
MODEL_ID = "unsloth/Qwen3-32B"
MAX_SEQ_LENGTH = 4096
@app.function(
image=finetune_image,
gpu="L40S",
secrets=[modal.Secret.from_name("deeplake-creds")],
volumes={"/output": output_vol, "/model-cache": model_vol},
timeout=6 * 3600, # 6 hours max
)
def finetune():
import os
os.environ["HF_HOME"] = "/model-cache"
from unsloth import FastLanguageModel
from deeplake import Client
from trl import SFTTrainer, SFTConfig
# ── 1. Setup and stream the dataset ─────────────────────────────
client = Client()
ds = client.open_table("sft_dataset")
train_dataset = ds.pytorch()
print(f"Training dataset: {len(ds)} rows (streamed from Deeplake)")
# ── 2. Load model with Unsloth (4-bit, 2x faster) ──────────────
model, tokenizer = FastLanguageModel.from_pretrained(
model_name=MODEL_ID,
max_seq_length=MAX_SEQ_LENGTH,
load_in_4bit=True,
dtype=None, # auto-detect
)
print(f"Model loaded: {MODEL_ID} (4-bit)")
# ── 3. Configure LoRA adapters ──────────────────────────────────
model = FastLanguageModel.get_peft_model(
model,
r=16,
lora_alpha=16,
lora_dropout=0.0,
target_modules=[
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
use_rslora=False,
use_gradient_checkpointing="unsloth", # 60% less VRAM
)
model.print_trainable_parameters()
# ── 4. Format training data ─────────────────────────────────────
# Deeplake streams rows as dicts. Convert to chat-format strings.
def formatting_func(example):
return (
f"<|im_start|>user\n{example['instruction']}<|im_end|>\n"
f"<|im_start|>assistant\n{example['response']}<|im_end|>"
)
# ── 5. Train ────────────────────────────────────────────────────
training_args = SFTConfig(
output_dir="/output/qwen3-32b-sft",
num_train_epochs=3,
per_device_train_batch_size=16,
gradient_accumulation_steps=1,
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.06,
weight_decay=0.01,
optim="adamw_8bit",
bf16=True,
logging_steps=10,
save_steps=200,
save_total_limit=3,
max_seq_length=MAX_SEQ_LENGTH,
dataset_text_field="text",
)
trainer = SFTTrainer(
model=model,
train_dataset=train_dataset,
formatting_func=formatting_func,
args=training_args,
tokenizer=tokenizer,
)
print("Starting training...")
trainer.train()
# ── 6. Save adapter weights ─────────────────────────────────────
trainer.save_model("/output/qwen3-32b-sft/final")
tokenizer.save_pretrained("/output/qwen3-32b-sft/final")
output_vol.commit()
print("Done. Adapter saved to /output/qwen3-32b-sft/final")
@app.local_entrypoint()
def main():
finetune.remote()
print("Fine-tuning job submitted to Modal")
Run it:
This launches a single L40S on Modal, loads Qwen3-32B in 4-bit with Unsloth (2-5x faster than standard training), and streams training data directly from Deeplake. The LoRA adapter weights are saved to a Modal volume.
Step 3: Download the Adapter¶
Pull the fine-tuned adapter weights from the Modal volume:
Load the adapter locally for inference:
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="./my-adapter",
max_seq_length=4096,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
inputs = tokenizer(
"<|im_start|>user\nExplain transformers in one sentence.<|im_end|>\n<|im_start|>assistant\n",
return_tensors="pt",
).to(model.device)
output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Why Streaming Matters¶
| Approach | Setup time | Disk needed | Works for 1 TB datasets? |
|---|---|---|---|
| Download dataset first | Minutes to hours | Full dataset size | No (disk limit) |
| Deeplake streaming | Instant | ~0 (stream on-demand) | Yes |
Deeplake's ds.pytorch() uses optimized C++ kernels to stream data from cloud storage directly into the training loop. The memory footprint stays constant regardless of dataset size, whether you have 10,000 rows or 10 million.
Tips¶
Iterate on a smaller model first
Test your pipeline with unsloth/Qwen3-4B before moving to the 32B. It trains much faster and fits on any GPU:
Scale to A100 for larger models or longer sequences
For full-precision LoRA or max-length sequences, switch to A100:
Monitor training
Logs stream to your terminal in real-time:
Or check the Modal dashboard for GPU utilization and logs.What to try next¶
- GPU-Streaming Pipeline: more details on Deeplake's PyTorch streaming
- E2B Sandboxes: persistent storage for ephemeral sandboxes
- Retrieval to Training: use vector search to curate training sets