Modal¶

Modal is a serverless cloud platform that lets you run Python functions on GPU infrastructure without managing servers. Functions spin up in seconds with the exact GPU you need, and shut down automatically when done. You only pay for what you use.

Deeplake + Modal is a natural fit for fine-tuning: your training data lives in Deeplake, Modal provides the GPU, and Deeplake's PyTorch streaming dataloader feeds data directly from cloud storage into the training loop. No dataset download, no local disk, no data pipeline to maintain.

Objective¶

Fine-tune Qwen3-32B with QLoRA on a Modal L40S GPU using Unsloth for 2-5x faster training, streaming the dataset from Deeplake.

Architecture¶

┌──────────────────────────┐
│  Your laptop              │
│  modal run finetune.py    │
└────────────┬─────────────┘
             │
    ┌────────▼────────────────────────┐
    │  Modal  (L40S 48GB)             │
    │                                 │
    │  Qwen3-32B (QLoRA, 4-bit)       │
    │       ▲                         │
    │       │ ds.pytorch()            │
    │       │ (streaming dataloader)  │
    │       │                         │
    │  Deeplake SDK ──── streams ────┼──► Deeplake Cloud
    │                                 │    (training data)
    └─────────────────────────────────┘

Data is streamed on-demand from Deeplake's object storage directly into GPU memory. The training machine never downloads the full dataset. The memory footprint stays constant whether the dataset is 1 GB or 1 TB.

Prerequisites¶

pip install modal
modal setup          # One-time auth

Store your Deeplake token as a Modal secret:

modal secret create deeplake-creds \
  DEEPLAKE_API_KEY=YOUR_TOKEN \
  DEEPLAKE_WORKSPACE=YOUR_WORKSPACE

Set credentials first

export DEEPLAKE_API_KEY="your-token-here"
export DEEPLAKE_WORKSPACE="your-workspace"  # optional, defaults to "default"

Step 1: Prepare Training Data in Deeplake¶

Upload your instruction-tuning dataset to a Deeplake table. Each row has a prompt and a completion:

Python SDKREST API

from deeplake import Client

client = Client()

# Ingest the dataset
client.ingest("sft_dataset", {
    "instruction": [
        "Explain quantum computing in simple terms.",
        "Write a Python function to merge two sorted lists.",
        "Summarize the key ideas of the attention mechanism.",
        # ... thousands more rows
    ],
    "response": [
        "Quantum computing uses qubits that can be 0 and 1 at the same time...",
        "def merge_sorted(a, b):\n    result = []\n    i = j = 0\n    while i < len(a) and j < len(b):\n        ...",
        "The attention mechanism allows a model to focus on relevant parts of the input...",
        # ... matching completions
    ],
})

# Requires: export DEEPLAKE_API_KEY="..." (see quickstart)
# Requires: export DEEPLAKE_ORG_ID="your-org-id"
API_URL="https://api.deeplake.ai"

# Create the table
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "CREATE TABLE IF NOT EXISTS \"'"$DEEPLAKE_WORKSPACE"'\".\"sft_dataset\" (id SERIAL PRIMARY KEY, instruction TEXT, response TEXT) USING deeplake"
  }'

# Insert training examples (batch)
curl -s -X POST "$API_URL/workspaces/$DEEPLAKE_WORKSPACE/tables/query" \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer $DEEPLAKE_API_KEY" \
  -H "X-Activeloop-Org-Id: $DEEPLAKE_ORG_ID" \
  -d '{
    "query": "INSERT INTO \"'"$DEEPLAKE_WORKSPACE"'\".\"sft_dataset\" (instruction, response) VALUES ($1, $2), ($3, $4), ($5, $6)",
    "params": ["Explain quantum computing in simple terms.", "Quantum computing uses qubits that can be 0 and 1 at the same time...", "Write a Python function to merge two sorted lists.", "def merge_sorted(a, b):\n    result = []\n    ...", "Summarize the key ideas of the attention mechanism.", "The attention mechanism allows a model to focus on relevant parts of the input..."]
  }'

This is the complete fine-tuning script. It follows Modal's recommended Unsloth setup, replacing the dataset source with Deeplake streaming:

import modal

app = modal.App("deeplake-finetune")

# Container image: Unsloth handles torch, transformers, peft, trl, bitsandbytes
finetune_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "unsloth[cu124-torch260]",
        "deeplake",
    )
)

# Persist adapter weights across runs
output_vol = modal.Volume.from_name("finetune-output", create_if_missing=True)

# Cache model weights so they're not re-downloaded every run
model_vol = modal.Volume.from_name("model-cache", create_if_missing=True)

MODEL_ID = "unsloth/Qwen3-32B"
MAX_SEQ_LENGTH = 4096


@app.function(
    image=finetune_image,
    gpu="L40S",
    secrets=[modal.Secret.from_name("deeplake-creds")],
    volumes={"/output": output_vol, "/model-cache": model_vol},
    timeout=6 * 3600,  # 6 hours max
)
def finetune():
    import os

    os.environ["HF_HOME"] = "/model-cache"

    from unsloth import FastLanguageModel
    from deeplake import Client
    from trl import SFTTrainer, SFTConfig

    # ── 1. Setup and stream the dataset ─────────────────────────────

    client = Client()
    ds = client.open_table("sft_dataset")
    train_dataset = ds.pytorch()

    print(f"Training dataset: {len(ds)} rows (streamed from Deeplake)")

    # ── 2. Load model with Unsloth (4-bit, 2x faster) ──────────────

    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name=MODEL_ID,
        max_seq_length=MAX_SEQ_LENGTH,
        load_in_4bit=True,
        dtype=None,  # auto-detect
    )

    print(f"Model loaded: {MODEL_ID} (4-bit)")

    # ── 3. Configure LoRA adapters ──────────────────────────────────

    model = FastLanguageModel.get_peft_model(
        model,
        r=16,
        lora_alpha=16,
        lora_dropout=0.0,
        target_modules=[
            "q_proj", "k_proj", "v_proj", "o_proj",
            "gate_proj", "up_proj", "down_proj",
        ],
        use_rslora=False,
        use_gradient_checkpointing="unsloth",  # 60% less VRAM
    )

    model.print_trainable_parameters()

    # ── 4. Format training data ─────────────────────────────────────

    # Deeplake streams rows as dicts. Convert to chat-format strings.
    def formatting_func(example):
        return (
            f"<|im_start|>user\n{example['instruction']}<|im_end|>\n"
            f"<|im_start|>assistant\n{example['response']}<|im_end|>"
        )

    # ── 5. Train ────────────────────────────────────────────────────

    training_args = SFTConfig(
        output_dir="/output/qwen3-32b-sft",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        gradient_accumulation_steps=1,
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.06,
        weight_decay=0.01,
        optim="adamw_8bit",
        bf16=True,
        logging_steps=10,
        save_steps=200,
        save_total_limit=3,
        max_seq_length=MAX_SEQ_LENGTH,
        dataset_text_field="text",
    )

    trainer = SFTTrainer(
        model=model,
        train_dataset=train_dataset,
        formatting_func=formatting_func,
        args=training_args,
        tokenizer=tokenizer,
    )

    print("Starting training...")
    trainer.train()

    # ── 6. Save adapter weights ─────────────────────────────────────

    trainer.save_model("/output/qwen3-32b-sft/final")
    tokenizer.save_pretrained("/output/qwen3-32b-sft/final")
    output_vol.commit()

    print("Done. Adapter saved to /output/qwen3-32b-sft/final")


@app.local_entrypoint()
def main():
    finetune.remote()
    print("Fine-tuning job submitted to Modal")

Run it:

modal run finetune.py

This launches a single L40S on Modal, loads Qwen3-32B in 4-bit with Unsloth (2-5x faster than standard training), and streams training data directly from Deeplake. The LoRA adapter weights are saved to a Modal volume.

Step 3: Download the Adapter¶

Pull the fine-tuned adapter weights from the Modal volume:

modal volume get finetune-output qwen3-32b-sft/final ./my-adapter

Load the adapter locally for inference:

from unsloth import FastLanguageModel

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="./my-adapter",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

inputs = tokenizer(
    "<|im_start|>user\nExplain transformers in one sentence.<|im_end|>\n<|im_start|>assistant\n",
    return_tensors="pt",
).to(model.device)

output = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Why Streaming Matters¶

Approach	Setup time	Disk needed	Works for 1 TB datasets?
Download dataset first	Minutes to hours	Full dataset size	No (disk limit)
Deeplake streaming	Instant	~0 (stream on-demand)	Yes

Deeplake's ds.pytorch() uses optimized C++ kernels to stream data from cloud storage directly into the training loop. The memory footprint stays constant regardless of dataset size, whether you have 10,000 rows or 10 million.

Tips¶

Iterate on a smaller model first

Test your pipeline with unsloth/Qwen3-4B before moving to the 32B. It trains much faster and fits on any GPU:

MODEL_ID = "unsloth/Qwen3-4B"  # Quick iteration

Scale to A100 for larger models or longer sequences

For full-precision LoRA or max-length sequences, switch to A100:

@app.function(gpu="A100-80GB")

Monitor training

Logs stream to your terminal in real-time:

modal run finetune.py

Or check the Modal dashboard for GPU utilization and logs.

What to try next¶

GPU-Streaming Pipeline: more details on Deeplake's PyTorch streaming
E2B Sandboxes: persistent storage for ephemeral sandboxes
Retrieval to Training: use vector search to curate training sets