Back to Blog
ModalServerless MLGPUPythonMachine LearningLLMInferenceFine-TuningCloud InfrastructureMLOps

Modal for Serverless ML — GPU Workloads, Custom Containers, and Scheduled Jobs

A practical guide to Modal for serverless ML infrastructure: the event-driven execution model where @app.function()-decorated Python functions run in ephemeral cloud containers that provision in seconds and scale to zero between invocations eliminating per-GPU idle costs, modal.Image.debian_slim() and Image.from_registry() for defining containerized environments with pip_install(), apt_install(), and run_commands() chaining that build reproducibly and cache each layer so subsequent deploys skip unchanged layers, GPU function configuration with the gpu= parameter accepting string shortcuts for A10G, A100, H100, L4, and T4 or modal.gpu.A100(memory=80) and modal.gpu.H100() objects for explicit memory and count control, container_idle_timeout for keeping containers warm between requests to avoid repeated model loading latency, allow_concurrent_inputs for request batching within a single container to increase inference throughput, modal.Volume.from_name() for persistent cross-function filesystem storage of model weights, datasets, and checkpoints with commit() durability semantics that flush writes to distributed storage, modal.Cls with @modal.enter() lifecycle hooks that load GPU models once per container initialization rather than per request enabling warm container reuse for latency-sensitive inference, @modal.method() for class-based function definitions that reuse loaded model state across concurrent requests, @modal.web_endpoint() decorator for creating HTTPS API endpoints from any function with Pydantic request and response models and automatic docs, @modal.asgi_app() for full FastAPI ASGI application serving with middleware, Bearer token authentication, and multiple routes, modal.Cron() with standard five-field cron expressions and modal.Period() with fixed intervals for scheduled function execution managed as part of the deployed app without external cron infrastructure, modal.Secret.from_name() and Secret.from_dict() for injecting credentials and API keys from the Modal dashboard into function environments without hardcoding secrets in code or images, Function.map() for submitting all inputs simultaneously and running them across parallel containers with return_exceptions=True for fault tolerance, Function.starmap() for argument tuple expansion, modal.Dict for distributed key-value state sharing across parallel invocations, and a 10-point production checklist covering @modal.enter() model loading strategy, package version pinning for image layer caching, volume.commit() durability semantics, timeout configuration for training jobs, concurrency_limit cost bounding, modal deploy versus modal serve usage, Function.with_options() for staging overrides, modal.Mount for configuration file injection, modal.Dict for distributed coordination state, and structured logging with @modal.exit() Slack alert webhooks for production monitoring.

2026-07-04

Why Modal Over Persistent GPU Instances

The economics of GPU infrastructure follow a painful pattern. A training job runs for two hours per day; the GPU instance runs for 24. An inference endpoint handles bursts of traffic for four hours each afternoon; the instance sits idle for the other 20. At $3.06 per hour for a p3.2xlarge on-demand, that idle time costs $1,600 per month before you write a single line of ML code. The traditional answer — spot instances, autoscaling groups, container orchestration — requires significant infrastructure work that produces no model improvement.

Modal is a cloud platform that runs Python functions in ephemeral containers that provision in seconds and scale to zero between invocations. You decorate a function with @app.function(gpu="A10G"), call it with .remote(), and Modal handles provisioning, container startup, GPU allocation, and teardown. You pay only for the seconds the GPU is actually running your code. There are no instances to manage, no Docker registries to maintain, no Kubernetes clusters to operate. The Ray distributed ML guide covers an alternative approach for teams that need persistent multi-node clusters with dynamic resource allocation across heterogeneous hardware — a better fit when jobs run continuously rather than in burst patterns — Modal's strength is the opposite end of the spectrum: irregular, bursty workloads where idle cost dominates.

Zero Infrastructure

No clusters, no Kubernetes, no Docker registries. Define your environment in Python and Modal builds and caches the container image automatically on first run.

Scale to Zero

Containers start in seconds and terminate when idle. GPU compute is billed per second of actual execution — not per hour of instance uptime.

Python-Native

Functions, classes, web endpoints, scheduled jobs, and secrets are all configured via Python decorators. No YAML manifests, no Dockerfiles required.

Installation and Your First Modal Function

Modal requires Python 3.8+ and authenticates via a token stored in your home directory. The CLI generates the token by opening a browser window to the Modal dashboard — no manual credential copying.

pip install modal

# Authenticate — opens a browser to generate an API token
modal token new

# Verify installation
modal --version

A Modal application is a Python file containing a modal.App instance and decorated functions. The @app.local_entrypoint() decorator marks the function that runs when you execute modal run from the CLI — it runs locally and calls remote functions via .remote().

# hello_modal.py
import modal

app = modal.App("hello-modal")

@app.function()
def square(x: int) -> int:
    import time
    time.sleep(0.5)  # simulate work
    return x * x

@app.local_entrypoint()
def main():
    # .remote() runs the function in a Modal cloud container
    result = square.remote(7)
    print(f"7² = {result}")  # 49

    # .map() runs the function in parallel for each input
    results = list(square.map([1, 2, 3, 4, 5]))
    print(results)  # [1, 4, 9, 16, 25]
# Run in development (hot-reloads on file changes):
modal run hello_modal.py

# Deploy to Modal's cloud for persistent serving:
modal deploy hello_modal.py

# Stream logs from a deployed app:
modal logs hello-modal

Custom Container Images — Building Reproducible ML Environments

Modal builds container images from Python using a fluent builder API. Each builder call — pip_install(), apt_install(), run_commands() — corresponds to a cached layer. If you add a new package to an existing image definition, Modal only re-runs layers from the change point forward; earlier layers are served from cache. This makes iterative ML environment development fast: adding a new pip package to a 10-minute CUDA image takes 30 seconds on the second build.

import modal

# Base Debian image with Python and ML packages
ml_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch==2.3.0",
        "transformers==4.41.0",
        "accelerate==0.30.0",
        "bitsandbytes==0.43.1",
        "sentencepiece==0.2.0",
        "protobuf==4.25.3",
    )
    .apt_install("libgomp1")  # OpenMP for quantized inference
)

# Build from a pre-existing Docker Hub image (CUDA base)
cuda_image = (
    modal.Image.from_registry(
        "nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
        add_python="3.11",
    )
    .pip_install(
        "torch==2.3.0+cu121",
        "triton==2.3.0",
        "--extra-index-url", "https://download.pytorch.org/whl/cu121",
    )
)

# Build from a local Dockerfile
dockerfile_image = modal.Image.from_dockerfile(
    path="./Dockerfile.ml",
    context_mount=modal.Mount.from_local_dir(
        "./build_context",
        remote_path="/build_context",
    ),
)

# Install a package that requires a build step
vllm_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("vllm==0.5.0", extra_options="--no-build-isolation")
    .run_commands(
        # Compile flash-attention separately to cache it
        "pip install flash-attn==2.5.9 --no-build-isolation"
    )
)

Note

Pin every package version in pip_install() calls. Modal caches image layers based on the exact arguments — unpinned dependencies cause cache misses on every build when the latest version changes. Pin at the patch level (transformers==4.41.0, not transformers==4.41) to avoid silent version drift between deploys. Use pip freeze in a local venv after testing to generate pinned versions before committing your image definition.

GPU Functions — Inference and Fine-Tuning

GPU allocation is a single parameter. The gpu= parameter accepts string shortcuts for common tiers or explicit GPU objects for memory and count control. Modal supports T4, L4, A10G, A100, and H100 across 40GB and 80GB memory variants. The MLOps CI/CD guide covers how to wrap Modal deployments in automated evaluation pipelines — running regression tests against the deployed endpoint before routing production traffic to a new model version — the two layers complement each other: Modal handles infrastructure provisioning, the CI pipeline handles promotion gating.

app = modal.App("gpu-inference")

inference_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch==2.3.0",
        "transformers==4.41.0",
        "accelerate==0.30.0",
    )
)

# Simple GPU function — provisions A10G, runs function, terminates
@app.function(
    image=inference_image,
    gpu="A10G",          # 24 GB VRAM, billed per second
    timeout=300,         # seconds before Modal kills the container
    retries=2,           # retry on transient failures
)
def run_inference(prompt: str) -> str:
    from transformers import pipeline
    import torch

    pipe = pipeline(
        "text-generation",
        model="mistralai/Mistral-7B-Instruct-v0.2",
        torch_dtype=torch.float16,
        device_map="cuda:0",
    )
    outputs = pipe(
        prompt,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
    )
    return outputs[0]["generated_text"]


# Fine-tuning function — needs larger GPU and more time
training_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch==2.3.0",
        "transformers==4.41.0",
        "accelerate==0.30.0",
        "bitsandbytes==0.43.1",
        "peft==0.11.1",
        "datasets==2.20.0",
        "trl==0.9.4",
    )
)

@app.function(
    image=training_image,
    gpu=modal.gpu.A100(memory=80),   # 80 GB A100 for large model fine-tuning
    timeout=7200,                     # 2 hours for training run
    secrets=[modal.Secret.from_name("huggingface-token")],
)
def fine_tune_llm(dataset_name: str, output_path: str) -> dict:
    import os
    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
    from peft import LoraConfig, get_peft_model
    from datasets import load_dataset
    from trl import SFTTrainer

    token = os.environ["HF_TOKEN"]

    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Meta-Llama-3-8B",
        token=token,
        torch_dtype=torch.bfloat16,
        device_map="auto",
    )

    lora_config = LoraConfig(
        r=16,
        lora_alpha=32,
        target_modules=["q_proj", "v_proj"],
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
    )
    model = get_peft_model(model, lora_config)
    model.print_trainable_parameters()

    dataset = load_dataset(dataset_name, split="train")

    trainer = SFTTrainer(
        model=model,
        train_dataset=dataset,
        args=TrainingArguments(
            output_dir=output_path,
            num_train_epochs=3,
            per_device_train_batch_size=4,
            gradient_accumulation_steps=4,
            learning_rate=2e-4,
            fp16=True,
            logging_steps=10,
            save_steps=100,
        ),
    )
    trainer.train()

    return {"status": "complete", "output_path": output_path}

Persistent Storage with Modal Volumes

Modal functions are stateless by design — the container filesystem is ephemeral. modal.Volumeprovides a persistent network filesystem that mounts at a specified path inside the container. Volumes are backed by Modal's distributed storage and are accessible from any function in the app. The canonical pattern is to download model weights once into a Volume in a dedicated preparation function, then mount the same Volume in inference functions to avoid re-downloading multi-gigabyte models on every cold start.

# Create a named volume — persists across function runs
model_volume = modal.Volume.from_name(
    "llm-model-weights",
    create_if_missing=True,
)

@app.function(
    image=inference_image,
    volumes={"/models": model_volume},
    timeout=3600,
    secrets=[modal.Secret.from_name("huggingface-token")],
)
def download_model_weights(model_id: str):
    """Run once to populate the volume — not called on every inference."""
    import os
    import subprocess

    subprocess.run(
        [
            "huggingface-cli", "download",
            model_id,
            "--local-dir", f"/models/{model_id.replace('/', '--')}",
            "--token", os.environ["HF_TOKEN"],
        ],
        check=True,
    )
    # IMPORTANT: commit() makes writes visible to other containers
    model_volume.commit()
    print(f"Downloaded {model_id} to volume")


@app.function(
    image=inference_image,
    gpu="A10G",
    volumes={"/models": model_volume},  # same volume, read-only for inference
    timeout=120,
)
def infer_from_volume(prompt: str) -> str:
    from transformers import AutoModelForCausalLM, AutoTokenizer
    import torch

    model_path = "/models/mistralai--Mistral-7B-Instruct-v0.2"
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModelForCausalLM.from_pretrained(
        model_path,
        torch_dtype=torch.float16,
        device_map="cuda:0",
    )

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=256)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)


@app.local_entrypoint()
def prepare():
    # Run this once before deploying the inference function
    download_model_weights.remote("mistralai/Mistral-7B-Instruct-v0.2")

Note

volume.commit() is required after any write operation to flush changes from the container's local overlay filesystem to Modal's distributed storage. Writes that are not committed are discarded when the container terminates. Reads from a Volume always see the last committed state — multiple containers mounting the same Volume in read mode will all see the same consistent snapshot. For high-frequency write workloads, batch writes and commit less frequently rather than committing after every file write, since commit() incurs a network round-trip.

Class-Based Functions with Lifecycle Hooks

The biggest latency cost in LLM inference is model loading — reading multi-gigabyte weights from disk and moving them to GPU memory. With @app.function(), model loading happens on every cold container start. For inference serving, this is unacceptable: a 7B model takes 15–30 seconds to load, adding that latency to every request that hits a cold container.

modal.Cls solves this with lifecycle hooks. The @modal.enter() decorator marks a method that runs once per container initialization — before the first request is served. Model weights load in @modal.enter(), then the same loaded model is reused across all requests handled by that container. Combined with container_idle_timeout to keep containers alive between requests and allow_concurrent_inputs for request batching, this pattern gives inference latency comparable to a persistent server while still scaling to zero during idle periods.

from pydantic import BaseModel

class GenerateRequest(BaseModel):
    prompt: str
    max_tokens: int = 256
    temperature: float = 0.7

@app.cls(
    image=inference_image,
    gpu="A10G",
    volumes={"/models": model_volume},
    container_idle_timeout=300,    # keep container warm for 5 minutes after last request
    allow_concurrent_inputs=8,     # serve up to 8 concurrent requests per container
    concurrency_limit=5,           # cap at 5 parallel containers for cost control
)
class LlamaInference:

    @modal.enter()
    def load_model(self):
        """Runs once when the container starts — model stays loaded for all requests."""
        from transformers import AutoModelForCausalLM, AutoTokenizer
        import torch

        model_path = "/models/mistralai--Mistral-7B-Instruct-v0.2"
        self.tokenizer = AutoTokenizer.from_pretrained(model_path)
        self.model = AutoModelForCausalLM.from_pretrained(
            model_path,
            torch_dtype=torch.float16,
            device_map="cuda:0",
        )
        self.model.eval()
        print("Model loaded and ready")

    @modal.method()
    def generate(self, request: GenerateRequest) -> str:
        import torch

        inputs = self.tokenizer(
            request.prompt,
            return_tensors="pt",
            truncation=True,
            max_length=2048,
        ).to("cuda")

        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=request.max_tokens,
                temperature=request.temperature,
                do_sample=request.temperature > 0,
                pad_token_id=self.tokenizer.eos_token_id,
            )

        generated = outputs[0][inputs["input_ids"].shape[1]:]
        return self.tokenizer.decode(generated, skip_special_tokens=True)

    @modal.exit()
    def cleanup(self):
        """Optional: runs when container is shutting down."""
        import torch
        del self.model
        torch.cuda.empty_cache()


# Calling from a local entrypoint or another Modal function
@app.local_entrypoint()
def test_inference():
    model = LlamaInference()
    result = model.generate.remote(
        GenerateRequest(prompt="Explain transformers in one paragraph.")
    )
    print(result)

Web Endpoints and API Serving

Modal can expose any function as an HTTPS endpoint. The @modal.web_endpoint() decorator creates a simple JSON endpoint from a regular function with automatic request/response serialization via Pydantic. For more complex APIs with middleware, authentication, and multiple routes, @modal.asgi_app() returns a FastAPI application that Modal serves as a full ASGI server.

from pydantic import BaseModel
from typing import Optional

class EmbedRequest(BaseModel):
    texts: list[str]
    model: str = "BAAI/bge-base-en-v1.5"
    normalize: bool = True

class EmbedResponse(BaseModel):
    embeddings: list[list[float]]
    model: str
    dimensions: int

embedding_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("sentence-transformers==3.0.1", "torch==2.3.0")
)

@app.cls(
    image=embedding_image,
    gpu="T4",
    container_idle_timeout=180,
    allow_concurrent_inputs=16,
)
class EmbeddingService:

    @modal.enter()
    def load(self):
        from sentence_transformers import SentenceTransformer
        self.models: dict = {}

    def _get_model(self, model_name: str):
        if model_name not in self.models:
            from sentence_transformers import SentenceTransformer
            self.models[model_name] = SentenceTransformer(model_name)
        return self.models[model_name]

    @modal.web_endpoint(method="POST", docs=True)
    def embed(self, request: EmbedRequest) -> EmbedResponse:
        model = self._get_model(request.model)
        embeddings = model.encode(
            request.texts,
            normalize_embeddings=request.normalize,
            batch_size=64,
        )
        return EmbedResponse(
            embeddings=embeddings.tolist(),
            model=request.model,
            dimensions=embeddings.shape[1],
        )
# Full FastAPI app with auth and multiple routes
from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.middleware.cors import CORSMiddleware
import os

def create_app() -> FastAPI:
    web_app = FastAPI(title="ML Inference API", version="1.0.0")

    web_app.add_middleware(
        CORSMiddleware,
        allow_origins=["https://yourdomain.com"],
        allow_methods=["POST"],
        allow_headers=["Authorization", "Content-Type"],
    )

    async def verify_token(authorization: str = Header(...)):
        expected = f"Bearer {os.environ['API_KEY']}"
        if authorization != expected:
            raise HTTPException(status_code=401, detail="Invalid token")

    @web_app.post("/v1/completions", dependencies=[Depends(verify_token)])
    async def completions(request: GenerateRequest):
        # Call internal Modal function — runs in same container
        return {"text": "..."}

    @web_app.get("/health")
    async def health():
        return {"status": "ok"}

    return web_app


@app.function(
    image=inference_image,
    gpu="A10G",
    secrets=[modal.Secret.from_name("api-secrets")],
    container_idle_timeout=300,
    allow_concurrent_inputs=10,
)
@modal.asgi_app()
def fastapi_inference():
    return create_app()
# Deploying and accessing the endpoint
modal deploy inference_api.py

# Modal outputs the HTTPS URL after deployment:
# ✓ Created web endpoint https://yourorg--app-name-fastapi-inference.modal.run

# Test the endpoint
curl -X POST https://yourorg--app-name-fastapi-inference.modal.run/v1/completions \
  -H "Authorization: Bearer ${API_KEY}" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Summarize this document:", "max_tokens": 128}'

Scheduled Jobs and Cron

Modal's scheduling system replaces external cron infrastructure for ML workflows. A schedule=modal.Cron("0 2 * * *") parameter fires the function daily at 2 AM UTC with the same GPU and container configuration as on-demand calls. modal.Period(hours=6) fires at a fixed interval. Both are managed as part of the deployed app — no separate cron server, Lambda trigger, or Cloud Scheduler configuration required. The ML pipeline MLflow guide covers how to track scheduled training runs — logging parameters, metrics, and artifacts to the MLflow experiment registry from within a Modal function for reproducibility and experiment comparison — Modal handles the compute provisioning while MLflow handles the experiment ledger.

import modal

app = modal.App("ml-scheduled-jobs")

training_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install(
        "torch==2.3.0",
        "transformers==4.41.0",
        "accelerate==0.30.0",
        "mlflow==2.14.1",
        "datasets==2.20.0",
        "peft==0.11.1",
        "trl==0.9.4",
    )
)

model_volume = modal.Volume.from_name("ml-model-weights", create_if_missing=True)

# Nightly fine-tuning — runs daily at 2 AM UTC on an A100
@app.function(
    image=training_image,
    gpu=modal.gpu.A100(memory=80),
    volumes={"/models": model_volume, "/checkpoints": modal.Volume.from_name("training-checkpoints", create_if_missing=True)},
    schedule=modal.Cron("0 2 * * *"),
    timeout=10800,  # 3 hours max
    secrets=[
        modal.Secret.from_name("huggingface-token"),
        modal.Secret.from_name("mlflow-credentials"),
    ],
)
def nightly_fine_tuning():
    import os
    import mlflow
    from datasets import load_dataset

    mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])

    with mlflow.start_run(run_name="nightly-finetune"):
        dataset = load_dataset("your-org/daily-feedback", split="train")
        mlflow.log_param("dataset_size", len(dataset))

        # ... training logic ...

        mlflow.log_metric("train_loss", 0.42)
        mlflow.log_artifact("/checkpoints/model_weights.bin")

    # Persist checkpoint to volume
    model_volume.commit()


# Hourly embedding refresh — runs every hour on a T4
@app.function(
    image=(
        modal.Image.debian_slim(python_version="3.11")
        .pip_install("sentence-transformers==3.0.1", "torch==2.3.0", "psycopg2-binary==2.9.9")
    ),
    gpu="T4",
    schedule=modal.Period(hours=1),
    timeout=1800,
    secrets=[modal.Secret.from_name("database-credentials")],
)
def refresh_product_embeddings():
    import os
    import psycopg2
    from sentence_transformers import SentenceTransformer

    model = SentenceTransformer("BAAI/bge-base-en-v1.5")

    conn = psycopg2.connect(os.environ["DATABASE_URL"])
    cur = conn.cursor()

    # Fetch products updated since last run
    cur.execute("""
        SELECT id, title || ' ' || description AS text
        FROM products
        WHERE updated_at > NOW() - INTERVAL '2 hours'
        AND embedding_refreshed_at < updated_at
    """)
    rows = cur.fetchall()

    if not rows:
        print("No products to re-embed")
        return

    ids = [r[0] for r in rows]
    texts = [r[1] for r in rows]
    embeddings = model.encode(texts, normalize_embeddings=True)

    for pid, emb in zip(ids, embeddings):
        cur.execute(
            "UPDATE products SET embedding = %s, embedding_refreshed_at = NOW() WHERE id = %s",
            (emb.tolist(), pid),
        )
    conn.commit()
    print(f"Re-embedded {len(ids)} products")

Note

Scheduled functions deployed with modal deploy run even when no local process is running. Modal's scheduler tracks missed runs when a function fails — if a nightly job errors, it does not automatically retry at the next scheduled time. Build idempotent scheduled functions that can be re-run manually via modal run without side effects, and add alerting via webhook calls in the function body or a @modal.exit() handler that posts to Slack on failure.

Secrets and Environment Configuration

Modal secrets are key-value stores managed through the Modal dashboard or CLI. A secret is referenced by name and injected into the container environment at runtime. Functions declare which secrets they need in the secrets= parameter — Modal resolves and injects them before the container starts. Secrets can be rotated in the Modal dashboard without redeploying the function, since they are resolved at container startup time rather than at build time.

# Create secrets via CLI
modal secret create huggingface-token HF_TOKEN=hf_xxxxxxxxxxxx
modal secret create wandb-credentials WANDB_API_KEY=xxxxxxxxxx WANDB_PROJECT=my-project
modal secret create database-credentials DATABASE_URL=postgresql://user:pass@host:5432/db
modal secret create api-secrets API_KEY=sk_xxxxxxxxxxxxxxxx

# Or create from a .env file
modal secret create ml-secrets --from-dotenv .env.production
# Reference secrets in function definitions
@app.function(
    image=training_image,
    gpu="A100",
    secrets=[
        modal.Secret.from_name("huggingface-token"),
        modal.Secret.from_name("wandb-credentials"),
        modal.Secret.from_name("database-credentials"),
    ],
)
def train_with_tracking():
    import os
    import wandb

    # Secrets are available as environment variables
    hf_token = os.environ["HF_TOKEN"]
    wandb.login(key=os.environ["WANDB_API_KEY"])

    # huggingface_hub will automatically pick up HF_TOKEN
    from huggingface_hub import login
    login(token=hf_token)

    # ... rest of training logic ...


# Create a secret programmatically (for dynamic credentials)
@app.function()
def generate_temp_credentials():
    import boto3

    sts = boto3.client("sts")
    creds = sts.assume_role(
        RoleArn="arn:aws:iam::123456789:role/modal-access",
        RoleSessionName="modal-session",
    )["Credentials"]

    # Store dynamic credentials as a Modal secret for downstream functions
    modal.Secret.from_dict({
        "AWS_ACCESS_KEY_ID": creds["AccessKeyId"],
        "AWS_SECRET_ACCESS_KEY": creds["SecretAccessKey"],
        "AWS_SESSION_TOKEN": creds["SessionToken"],
    })

Parallel Map for Batch Workloads

Function.map() submits all inputs to Modal simultaneously and runs them in parallel across as many containers as Modal can provision. This is the primary pattern for dataset embedding, batch evaluation, and offline inference — workloads that process a large corpus and do not have sequential dependencies. Each item in the iterable maps to one function call; Modal schedules them across containers up to the function's concurrency_limit.

app = modal.App("batch-embedding")

embed_image = (
    modal.Image.debian_slim(python_version="3.11")
    .pip_install("sentence-transformers==3.0.1", "torch==2.3.0")
)

@app.function(
    image=embed_image,
    gpu="T4",
    concurrency_limit=20,           # run up to 20 GPU containers in parallel
    allow_concurrent_inputs=4,      # each container handles 4 concurrent requests
)
def embed_batch(texts: list[str]) -> list[list[float]]:
    from sentence_transformers import SentenceTransformer
    import torch

    model = SentenceTransformer(
        "BAAI/bge-large-en-v1.5",
        device="cuda" if torch.cuda.is_available() else "cpu",
    )
    return model.encode(texts, normalize_embeddings=True).tolist()


@app.local_entrypoint()
def embed_corpus():
    import json

    # Load 50,000 documents
    with open("documents.jsonl") as f:
        documents = [json.loads(line)["text"] for line in f]

    # Split into batches of 128 texts
    batch_size = 128
    batches = [
        documents[i : i + batch_size]
        for i in range(0, len(documents), batch_size)
    ]
    print(f"Processing {len(documents)} documents in {len(batches)} batches")

    # map() submits all batches concurrently — Modal provisions containers as needed
    # return_exceptions=True prevents one failed batch from killing the whole run
    all_embeddings = []
    for result in embed_batch.map(batches, return_exceptions=True):
        if isinstance(result, Exception):
            print(f"Batch failed: {result}")
            all_embeddings.extend([[]] * batch_size)  # placeholder for failed batch
        else:
            all_embeddings.extend(result)

    print(f"Embedded {len(all_embeddings)} documents")

    # starmap() for functions with multiple arguments
    # text_model_pairs = [("text1", "model-a"), ("text2", "model-b"), ...]
    # results = list(embed_with_model.starmap(text_model_pairs))

Production Checklist

1

Use @modal.enter() for all model loading — never load models inside @modal.method(). Loading a 7B model inside the method body means every request in a fresh container pays the full 15–30 second load penalty. With @modal.enter(), the model loads once per container startup and is reused across all requests that container serves. Set container_idle_timeout to a value that reflects your traffic pattern: 60s for low-traffic endpoints, 300s for moderate traffic, up to 3600s for always-warm serving.

2

Pin every package version in image definitions. Modal images are content-addressed — the same pip_install() call with the same version strings always resolves to the same cached layer. Unpinned dependencies like transformers or torch can resolve to different versions on different build dates, causing cache misses and subtle behavioral differences between deploys. Use pip freeze in a local venv that matches your Modal image's Python version to generate pinned requirements before committing the image definition.

3

Always call volume.commit() after writing files to a Volume. Writes to a mounted Volume go to a container-local overlay filesystem and are not visible to other containers until commit() is called. A training job that writes a checkpoint but crashes before commit() will appear to succeed while producing no persistent output. Add commit() calls at logical checkpoints during training, not just at the end, so a crashed run preserves intermediate progress.

4

Set timeout explicitly for long-running functions. Modal's default timeout is 300 seconds — enough for inference, not for training. A fine-tuning run that takes 4 hours will be killed at 5 minutes if you forget to set timeout=14400. Set timeout to 150% of your expected worst-case runtime, not average runtime. Functions approaching the timeout limit can be designed to checkpoint state to a Volume and restart from the last checkpoint rather than running the full job in one invocation.

5

Use concurrency_limit to bound costs for GPU workloads. Without a limit, a sudden traffic spike can provision hundreds of GPU containers simultaneously, each incurring per-second GPU billing. Set concurrency_limit to a value you can afford: for a web endpoint on A10G at $1.10/hr, 10 concurrent containers costs at most $11/hr during peak — predictable and within budget. Excess requests queue rather than failing, since Modal's scheduler holds them until a container becomes available.

6

Use modal serve for development iteration, modal deploy for production. modal serve starts the app with a file watcher that rebuilds and redeploys on code changes — ideal for rapid iteration on inference logic. Avoid serving production traffic through modal serve; it terminates when you close the terminal. modal deploy creates a persistent deployment that survives terminal closure, restarts on failure, and runs scheduled jobs.

7

Separate environment-specific configuration with Function.with_options(). Rather than maintaining separate app files for staging and production, define functions with production defaults and override with .with_options() for staging: staging_fn = production_fn.with_options(gpu='T4', concurrency_limit=2). This avoids drift between staging and production function definitions while still allowing resource tier differences.

8

Mount local files into containers with modal.Mount rather than hardcoding paths. Configuration files, prompt templates, and tokenizer vocabularies that live in your repository can be mounted into the container without rebuilding the image: @app.function(mounts=[modal.Mount.from_local_file('./prompts.yaml', remote_path='/app/prompts.yaml')]). This separates environment (image) from configuration (mount), enabling prompt updates without image rebuilds.

9

Use modal.Dict for distributed state when functions need to coordinate. modal.Dict is a distributed key-value store accessible from any function in the app. It is useful for deduplication state in parallel map() workloads, rate limiting counters, and progress tracking across concurrent containers. Unlike a database, it requires no connection management and is accessible with a single line: results_dict = modal.Dict.from_name('job-results', create_if_missing=True).

10

Monitor deployed endpoints with modal logs and structured logging. modal logs --follow <app-name> streams live logs from all containers to your terminal during development. In production, add structured JSON logging to functions so log aggregation systems can parse fields: json.dumps({'event': 'request', 'latency_ms': elapsed, 'tokens': count, 'model': model_name}). Set up Slack alerts by calling a webhook in a @modal.exit() handler that fires when a container terminates with an unhandled exception.

Your GPU instances run at 5% utilization between training jobs because you keep them alive to avoid cold start latency, your ML team deploys model updates by SSHing into servers and modifying files in-place with no version control or reproducibility, or your inference endpoint costs $3,000 per month even though peak traffic only lasts 4 hours per day?

We design and implement Modal-based ML infrastructure — from image definition with pip_install() layer caching strategy for fast iterative deploys and gpu= tier selection across T4, A10G, A100, and H100 for cost-to-performance optimization, through modal.Cls architecture with @modal.enter() model loading and allow_concurrent_inputs batching for warm-container inference serving, modal.Volume setup for persistent model weight storage with commit() durability and cross-function access patterns, @modal.web_endpoint() and @modal.asgi_app() FastAPI serving with Bearer token authentication and CORS configuration, modal.Cron and modal.Period scheduled job configuration for nightly fine-tuning and periodic embedding refresh runs, modal.Secret wiring for HuggingFace, WandB, and database credentials with dashboard rotation without redeployment, Function.map() parallel processing pipelines for dataset embedding and batch offline inference with return_exceptions fault tolerance, container_idle_timeout and concurrency_limit tuning for throughput versus cost trade-offs, and modal logs structured logging with @modal.exit() Slack webhook integration for production monitoring and failure alerting. Let’s talk.

Let's Talk

Related Articles

DataSOps Consulting

Need help implementing this in production?

We build and operate data pipelines, AI systems, and observability stacks for engineering teams. Reach out for a free 30-minute architecture review.