Why Modal Over Persistent GPU Instances
The economics of GPU infrastructure follow a painful pattern. A training job runs for two hours per day; the GPU instance runs for 24. An inference endpoint handles bursts of traffic for four hours each afternoon; the instance sits idle for the other 20. At $3.06 per hour for a p3.2xlarge on-demand, that idle time costs $1,600 per month before you write a single line of ML code. The traditional answer — spot instances, autoscaling groups, container orchestration — requires significant infrastructure work that produces no model improvement.
Modal is a cloud platform that runs Python functions in ephemeral containers that provision in seconds and scale to zero between invocations. You decorate a function with @app.function(gpu="A10G"), call it with .remote(), and Modal handles provisioning, container startup, GPU allocation, and teardown. You pay only for the seconds the GPU is actually running your code. There are no instances to manage, no Docker registries to maintain, no Kubernetes clusters to operate. The Ray distributed ML guide covers an alternative approach for teams that need persistent multi-node clusters with dynamic resource allocation across heterogeneous hardware — a better fit when jobs run continuously rather than in burst patterns — Modal's strength is the opposite end of the spectrum: irregular, bursty workloads where idle cost dominates.
Zero Infrastructure
No clusters, no Kubernetes, no Docker registries. Define your environment in Python and Modal builds and caches the container image automatically on first run.
Scale to Zero
Containers start in seconds and terminate when idle. GPU compute is billed per second of actual execution — not per hour of instance uptime.
Python-Native
Functions, classes, web endpoints, scheduled jobs, and secrets are all configured via Python decorators. No YAML manifests, no Dockerfiles required.
Installation and Your First Modal Function
Modal requires Python 3.8+ and authenticates via a token stored in your home directory. The CLI generates the token by opening a browser window to the Modal dashboard — no manual credential copying.
pip install modal
# Authenticate — opens a browser to generate an API token
modal token new
# Verify installation
modal --versionA Modal application is a Python file containing a modal.App instance and decorated functions. The @app.local_entrypoint() decorator marks the function that runs when you execute modal run from the CLI — it runs locally and calls remote functions via .remote().
# hello_modal.py
import modal
app = modal.App("hello-modal")
@app.function()
def square(x: int) -> int:
import time
time.sleep(0.5) # simulate work
return x * x
@app.local_entrypoint()
def main():
# .remote() runs the function in a Modal cloud container
result = square.remote(7)
print(f"7² = {result}") # 49
# .map() runs the function in parallel for each input
results = list(square.map([1, 2, 3, 4, 5]))
print(results) # [1, 4, 9, 16, 25]# Run in development (hot-reloads on file changes):
modal run hello_modal.py
# Deploy to Modal's cloud for persistent serving:
modal deploy hello_modal.py
# Stream logs from a deployed app:
modal logs hello-modalCustom Container Images — Building Reproducible ML Environments
Modal builds container images from Python using a fluent builder API. Each builder call — pip_install(), apt_install(), run_commands() — corresponds to a cached layer. If you add a new package to an existing image definition, Modal only re-runs layers from the change point forward; earlier layers are served from cache. This makes iterative ML environment development fast: adding a new pip package to a 10-minute CUDA image takes 30 seconds on the second build.
import modal
# Base Debian image with Python and ML packages
ml_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"torch==2.3.0",
"transformers==4.41.0",
"accelerate==0.30.0",
"bitsandbytes==0.43.1",
"sentencepiece==0.2.0",
"protobuf==4.25.3",
)
.apt_install("libgomp1") # OpenMP for quantized inference
)
# Build from a pre-existing Docker Hub image (CUDA base)
cuda_image = (
modal.Image.from_registry(
"nvidia/cuda:12.1.0-cudnn8-devel-ubuntu22.04",
add_python="3.11",
)
.pip_install(
"torch==2.3.0+cu121",
"triton==2.3.0",
"--extra-index-url", "https://download.pytorch.org/whl/cu121",
)
)
# Build from a local Dockerfile
dockerfile_image = modal.Image.from_dockerfile(
path="./Dockerfile.ml",
context_mount=modal.Mount.from_local_dir(
"./build_context",
remote_path="/build_context",
),
)
# Install a package that requires a build step
vllm_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("vllm==0.5.0", extra_options="--no-build-isolation")
.run_commands(
# Compile flash-attention separately to cache it
"pip install flash-attn==2.5.9 --no-build-isolation"
)
)Note
pip_install() calls. Modal caches image layers based on the exact arguments — unpinned dependencies cause cache misses on every build when the latest version changes. Pin at the patch level (transformers==4.41.0, not transformers==4.41) to avoid silent version drift between deploys. Use pip freeze in a local venv after testing to generate pinned versions before committing your image definition.GPU Functions — Inference and Fine-Tuning
GPU allocation is a single parameter. The gpu= parameter accepts string shortcuts for common tiers or explicit GPU objects for memory and count control. Modal supports T4, L4, A10G, A100, and H100 across 40GB and 80GB memory variants. The MLOps CI/CD guide covers how to wrap Modal deployments in automated evaluation pipelines — running regression tests against the deployed endpoint before routing production traffic to a new model version — the two layers complement each other: Modal handles infrastructure provisioning, the CI pipeline handles promotion gating.
app = modal.App("gpu-inference")
inference_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"torch==2.3.0",
"transformers==4.41.0",
"accelerate==0.30.0",
)
)
# Simple GPU function — provisions A10G, runs function, terminates
@app.function(
image=inference_image,
gpu="A10G", # 24 GB VRAM, billed per second
timeout=300, # seconds before Modal kills the container
retries=2, # retry on transient failures
)
def run_inference(prompt: str) -> str:
from transformers import pipeline
import torch
pipe = pipeline(
"text-generation",
model="mistralai/Mistral-7B-Instruct-v0.2",
torch_dtype=torch.float16,
device_map="cuda:0",
)
outputs = pipe(
prompt,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
)
return outputs[0]["generated_text"]
# Fine-tuning function — needs larger GPU and more time
training_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"torch==2.3.0",
"transformers==4.41.0",
"accelerate==0.30.0",
"bitsandbytes==0.43.1",
"peft==0.11.1",
"datasets==2.20.0",
"trl==0.9.4",
)
)
@app.function(
image=training_image,
gpu=modal.gpu.A100(memory=80), # 80 GB A100 for large model fine-tuning
timeout=7200, # 2 hours for training run
secrets=[modal.Secret.from_name("huggingface-token")],
)
def fine_tune_llm(dataset_name: str, output_path: str) -> dict:
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer
token = os.environ["HF_TOKEN"]
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3-8B",
token=token,
torch_dtype=torch.bfloat16,
device_map="auto",
)
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
dataset = load_dataset(dataset_name, split="train")
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=TrainingArguments(
output_dir=output_path,
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
save_steps=100,
),
)
trainer.train()
return {"status": "complete", "output_path": output_path}Persistent Storage with Modal Volumes
Modal functions are stateless by design — the container filesystem is ephemeral. modal.Volumeprovides a persistent network filesystem that mounts at a specified path inside the container. Volumes are backed by Modal's distributed storage and are accessible from any function in the app. The canonical pattern is to download model weights once into a Volume in a dedicated preparation function, then mount the same Volume in inference functions to avoid re-downloading multi-gigabyte models on every cold start.
# Create a named volume — persists across function runs
model_volume = modal.Volume.from_name(
"llm-model-weights",
create_if_missing=True,
)
@app.function(
image=inference_image,
volumes={"/models": model_volume},
timeout=3600,
secrets=[modal.Secret.from_name("huggingface-token")],
)
def download_model_weights(model_id: str):
"""Run once to populate the volume — not called on every inference."""
import os
import subprocess
subprocess.run(
[
"huggingface-cli", "download",
model_id,
"--local-dir", f"/models/{model_id.replace('/', '--')}",
"--token", os.environ["HF_TOKEN"],
],
check=True,
)
# IMPORTANT: commit() makes writes visible to other containers
model_volume.commit()
print(f"Downloaded {model_id} to volume")
@app.function(
image=inference_image,
gpu="A10G",
volumes={"/models": model_volume}, # same volume, read-only for inference
timeout=120,
)
def infer_from_volume(prompt: str) -> str:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "/models/mistralai--Mistral-7B-Instruct-v0.2"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="cuda:0",
)
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=256)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
@app.local_entrypoint()
def prepare():
# Run this once before deploying the inference function
download_model_weights.remote("mistralai/Mistral-7B-Instruct-v0.2")Note
volume.commit() is required after any write operation to flush changes from the container's local overlay filesystem to Modal's distributed storage. Writes that are not committed are discarded when the container terminates. Reads from a Volume always see the last committed state — multiple containers mounting the same Volume in read mode will all see the same consistent snapshot. For high-frequency write workloads, batch writes and commit less frequently rather than committing after every file write, since commit() incurs a network round-trip.Class-Based Functions with Lifecycle Hooks
The biggest latency cost in LLM inference is model loading — reading multi-gigabyte weights from disk and moving them to GPU memory. With @app.function(), model loading happens on every cold container start. For inference serving, this is unacceptable: a 7B model takes 15–30 seconds to load, adding that latency to every request that hits a cold container.
modal.Cls solves this with lifecycle hooks. The @modal.enter() decorator marks a method that runs once per container initialization — before the first request is served. Model weights load in @modal.enter(), then the same loaded model is reused across all requests handled by that container. Combined with container_idle_timeout to keep containers alive between requests and allow_concurrent_inputs for request batching, this pattern gives inference latency comparable to a persistent server while still scaling to zero during idle periods.
from pydantic import BaseModel
class GenerateRequest(BaseModel):
prompt: str
max_tokens: int = 256
temperature: float = 0.7
@app.cls(
image=inference_image,
gpu="A10G",
volumes={"/models": model_volume},
container_idle_timeout=300, # keep container warm for 5 minutes after last request
allow_concurrent_inputs=8, # serve up to 8 concurrent requests per container
concurrency_limit=5, # cap at 5 parallel containers for cost control
)
class LlamaInference:
@modal.enter()
def load_model(self):
"""Runs once when the container starts — model stays loaded for all requests."""
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_path = "/models/mistralai--Mistral-7B-Instruct-v0.2"
self.tokenizer = AutoTokenizer.from_pretrained(model_path)
self.model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map="cuda:0",
)
self.model.eval()
print("Model loaded and ready")
@modal.method()
def generate(self, request: GenerateRequest) -> str:
import torch
inputs = self.tokenizer(
request.prompt,
return_tensors="pt",
truncation=True,
max_length=2048,
).to("cuda")
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=request.max_tokens,
temperature=request.temperature,
do_sample=request.temperature > 0,
pad_token_id=self.tokenizer.eos_token_id,
)
generated = outputs[0][inputs["input_ids"].shape[1]:]
return self.tokenizer.decode(generated, skip_special_tokens=True)
@modal.exit()
def cleanup(self):
"""Optional: runs when container is shutting down."""
import torch
del self.model
torch.cuda.empty_cache()
# Calling from a local entrypoint or another Modal function
@app.local_entrypoint()
def test_inference():
model = LlamaInference()
result = model.generate.remote(
GenerateRequest(prompt="Explain transformers in one paragraph.")
)
print(result)Web Endpoints and API Serving
Modal can expose any function as an HTTPS endpoint. The @modal.web_endpoint() decorator creates a simple JSON endpoint from a regular function with automatic request/response serialization via Pydantic. For more complex APIs with middleware, authentication, and multiple routes, @modal.asgi_app() returns a FastAPI application that Modal serves as a full ASGI server.
from pydantic import BaseModel
from typing import Optional
class EmbedRequest(BaseModel):
texts: list[str]
model: str = "BAAI/bge-base-en-v1.5"
normalize: bool = True
class EmbedResponse(BaseModel):
embeddings: list[list[float]]
model: str
dimensions: int
embedding_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("sentence-transformers==3.0.1", "torch==2.3.0")
)
@app.cls(
image=embedding_image,
gpu="T4",
container_idle_timeout=180,
allow_concurrent_inputs=16,
)
class EmbeddingService:
@modal.enter()
def load(self):
from sentence_transformers import SentenceTransformer
self.models: dict = {}
def _get_model(self, model_name: str):
if model_name not in self.models:
from sentence_transformers import SentenceTransformer
self.models[model_name] = SentenceTransformer(model_name)
return self.models[model_name]
@modal.web_endpoint(method="POST", docs=True)
def embed(self, request: EmbedRequest) -> EmbedResponse:
model = self._get_model(request.model)
embeddings = model.encode(
request.texts,
normalize_embeddings=request.normalize,
batch_size=64,
)
return EmbedResponse(
embeddings=embeddings.tolist(),
model=request.model,
dimensions=embeddings.shape[1],
)# Full FastAPI app with auth and multiple routes
from fastapi import FastAPI, HTTPException, Depends, Header
from fastapi.middleware.cors import CORSMiddleware
import os
def create_app() -> FastAPI:
web_app = FastAPI(title="ML Inference API", version="1.0.0")
web_app.add_middleware(
CORSMiddleware,
allow_origins=["https://yourdomain.com"],
allow_methods=["POST"],
allow_headers=["Authorization", "Content-Type"],
)
async def verify_token(authorization: str = Header(...)):
expected = f"Bearer {os.environ['API_KEY']}"
if authorization != expected:
raise HTTPException(status_code=401, detail="Invalid token")
@web_app.post("/v1/completions", dependencies=[Depends(verify_token)])
async def completions(request: GenerateRequest):
# Call internal Modal function — runs in same container
return {"text": "..."}
@web_app.get("/health")
async def health():
return {"status": "ok"}
return web_app
@app.function(
image=inference_image,
gpu="A10G",
secrets=[modal.Secret.from_name("api-secrets")],
container_idle_timeout=300,
allow_concurrent_inputs=10,
)
@modal.asgi_app()
def fastapi_inference():
return create_app()# Deploying and accessing the endpoint
modal deploy inference_api.py
# Modal outputs the HTTPS URL after deployment:
# ✓ Created web endpoint https://yourorg--app-name-fastapi-inference.modal.run
# Test the endpoint
curl -X POST https://yourorg--app-name-fastapi-inference.modal.run/v1/completions \
-H "Authorization: Bearer ${API_KEY}" \
-H "Content-Type: application/json" \
-d '{"prompt": "Summarize this document:", "max_tokens": 128}'Scheduled Jobs and Cron
Modal's scheduling system replaces external cron infrastructure for ML workflows. A schedule=modal.Cron("0 2 * * *") parameter fires the function daily at 2 AM UTC with the same GPU and container configuration as on-demand calls. modal.Period(hours=6) fires at a fixed interval. Both are managed as part of the deployed app — no separate cron server, Lambda trigger, or Cloud Scheduler configuration required. The ML pipeline MLflow guide covers how to track scheduled training runs — logging parameters, metrics, and artifacts to the MLflow experiment registry from within a Modal function for reproducibility and experiment comparison — Modal handles the compute provisioning while MLflow handles the experiment ledger.
import modal
app = modal.App("ml-scheduled-jobs")
training_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install(
"torch==2.3.0",
"transformers==4.41.0",
"accelerate==0.30.0",
"mlflow==2.14.1",
"datasets==2.20.0",
"peft==0.11.1",
"trl==0.9.4",
)
)
model_volume = modal.Volume.from_name("ml-model-weights", create_if_missing=True)
# Nightly fine-tuning — runs daily at 2 AM UTC on an A100
@app.function(
image=training_image,
gpu=modal.gpu.A100(memory=80),
volumes={"/models": model_volume, "/checkpoints": modal.Volume.from_name("training-checkpoints", create_if_missing=True)},
schedule=modal.Cron("0 2 * * *"),
timeout=10800, # 3 hours max
secrets=[
modal.Secret.from_name("huggingface-token"),
modal.Secret.from_name("mlflow-credentials"),
],
)
def nightly_fine_tuning():
import os
import mlflow
from datasets import load_dataset
mlflow.set_tracking_uri(os.environ["MLFLOW_TRACKING_URI"])
with mlflow.start_run(run_name="nightly-finetune"):
dataset = load_dataset("your-org/daily-feedback", split="train")
mlflow.log_param("dataset_size", len(dataset))
# ... training logic ...
mlflow.log_metric("train_loss", 0.42)
mlflow.log_artifact("/checkpoints/model_weights.bin")
# Persist checkpoint to volume
model_volume.commit()
# Hourly embedding refresh — runs every hour on a T4
@app.function(
image=(
modal.Image.debian_slim(python_version="3.11")
.pip_install("sentence-transformers==3.0.1", "torch==2.3.0", "psycopg2-binary==2.9.9")
),
gpu="T4",
schedule=modal.Period(hours=1),
timeout=1800,
secrets=[modal.Secret.from_name("database-credentials")],
)
def refresh_product_embeddings():
import os
import psycopg2
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("BAAI/bge-base-en-v1.5")
conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()
# Fetch products updated since last run
cur.execute("""
SELECT id, title || ' ' || description AS text
FROM products
WHERE updated_at > NOW() - INTERVAL '2 hours'
AND embedding_refreshed_at < updated_at
""")
rows = cur.fetchall()
if not rows:
print("No products to re-embed")
return
ids = [r[0] for r in rows]
texts = [r[1] for r in rows]
embeddings = model.encode(texts, normalize_embeddings=True)
for pid, emb in zip(ids, embeddings):
cur.execute(
"UPDATE products SET embedding = %s, embedding_refreshed_at = NOW() WHERE id = %s",
(emb.tolist(), pid),
)
conn.commit()
print(f"Re-embedded {len(ids)} products")Note
modal deploy run even when no local process is running. Modal's scheduler tracks missed runs when a function fails — if a nightly job errors, it does not automatically retry at the next scheduled time. Build idempotent scheduled functions that can be re-run manually via modal run without side effects, and add alerting via webhook calls in the function body or a @modal.exit() handler that posts to Slack on failure.Secrets and Environment Configuration
Modal secrets are key-value stores managed through the Modal dashboard or CLI. A secret is referenced by name and injected into the container environment at runtime. Functions declare which secrets they need in the secrets= parameter — Modal resolves and injects them before the container starts. Secrets can be rotated in the Modal dashboard without redeploying the function, since they are resolved at container startup time rather than at build time.
# Create secrets via CLI
modal secret create huggingface-token HF_TOKEN=hf_xxxxxxxxxxxx
modal secret create wandb-credentials WANDB_API_KEY=xxxxxxxxxx WANDB_PROJECT=my-project
modal secret create database-credentials DATABASE_URL=postgresql://user:pass@host:5432/db
modal secret create api-secrets API_KEY=sk_xxxxxxxxxxxxxxxx
# Or create from a .env file
modal secret create ml-secrets --from-dotenv .env.production# Reference secrets in function definitions
@app.function(
image=training_image,
gpu="A100",
secrets=[
modal.Secret.from_name("huggingface-token"),
modal.Secret.from_name("wandb-credentials"),
modal.Secret.from_name("database-credentials"),
],
)
def train_with_tracking():
import os
import wandb
# Secrets are available as environment variables
hf_token = os.environ["HF_TOKEN"]
wandb.login(key=os.environ["WANDB_API_KEY"])
# huggingface_hub will automatically pick up HF_TOKEN
from huggingface_hub import login
login(token=hf_token)
# ... rest of training logic ...
# Create a secret programmatically (for dynamic credentials)
@app.function()
def generate_temp_credentials():
import boto3
sts = boto3.client("sts")
creds = sts.assume_role(
RoleArn="arn:aws:iam::123456789:role/modal-access",
RoleSessionName="modal-session",
)["Credentials"]
# Store dynamic credentials as a Modal secret for downstream functions
modal.Secret.from_dict({
"AWS_ACCESS_KEY_ID": creds["AccessKeyId"],
"AWS_SECRET_ACCESS_KEY": creds["SecretAccessKey"],
"AWS_SESSION_TOKEN": creds["SessionToken"],
})Parallel Map for Batch Workloads
Function.map() submits all inputs to Modal simultaneously and runs them in parallel across as many containers as Modal can provision. This is the primary pattern for dataset embedding, batch evaluation, and offline inference — workloads that process a large corpus and do not have sequential dependencies. Each item in the iterable maps to one function call; Modal schedules them across containers up to the function's concurrency_limit.
app = modal.App("batch-embedding")
embed_image = (
modal.Image.debian_slim(python_version="3.11")
.pip_install("sentence-transformers==3.0.1", "torch==2.3.0")
)
@app.function(
image=embed_image,
gpu="T4",
concurrency_limit=20, # run up to 20 GPU containers in parallel
allow_concurrent_inputs=4, # each container handles 4 concurrent requests
)
def embed_batch(texts: list[str]) -> list[list[float]]:
from sentence_transformers import SentenceTransformer
import torch
model = SentenceTransformer(
"BAAI/bge-large-en-v1.5",
device="cuda" if torch.cuda.is_available() else "cpu",
)
return model.encode(texts, normalize_embeddings=True).tolist()
@app.local_entrypoint()
def embed_corpus():
import json
# Load 50,000 documents
with open("documents.jsonl") as f:
documents = [json.loads(line)["text"] for line in f]
# Split into batches of 128 texts
batch_size = 128
batches = [
documents[i : i + batch_size]
for i in range(0, len(documents), batch_size)
]
print(f"Processing {len(documents)} documents in {len(batches)} batches")
# map() submits all batches concurrently — Modal provisions containers as needed
# return_exceptions=True prevents one failed batch from killing the whole run
all_embeddings = []
for result in embed_batch.map(batches, return_exceptions=True):
if isinstance(result, Exception):
print(f"Batch failed: {result}")
all_embeddings.extend([[]] * batch_size) # placeholder for failed batch
else:
all_embeddings.extend(result)
print(f"Embedded {len(all_embeddings)} documents")
# starmap() for functions with multiple arguments
# text_model_pairs = [("text1", "model-a"), ("text2", "model-b"), ...]
# results = list(embed_with_model.starmap(text_model_pairs))Production Checklist
Use @modal.enter() for all model loading — never load models inside @modal.method(). Loading a 7B model inside the method body means every request in a fresh container pays the full 15–30 second load penalty. With @modal.enter(), the model loads once per container startup and is reused across all requests that container serves. Set container_idle_timeout to a value that reflects your traffic pattern: 60s for low-traffic endpoints, 300s for moderate traffic, up to 3600s for always-warm serving.
Pin every package version in image definitions. Modal images are content-addressed — the same pip_install() call with the same version strings always resolves to the same cached layer. Unpinned dependencies like transformers or torch can resolve to different versions on different build dates, causing cache misses and subtle behavioral differences between deploys. Use pip freeze in a local venv that matches your Modal image's Python version to generate pinned requirements before committing the image definition.
Always call volume.commit() after writing files to a Volume. Writes to a mounted Volume go to a container-local overlay filesystem and are not visible to other containers until commit() is called. A training job that writes a checkpoint but crashes before commit() will appear to succeed while producing no persistent output. Add commit() calls at logical checkpoints during training, not just at the end, so a crashed run preserves intermediate progress.
Set timeout explicitly for long-running functions. Modal's default timeout is 300 seconds — enough for inference, not for training. A fine-tuning run that takes 4 hours will be killed at 5 minutes if you forget to set timeout=14400. Set timeout to 150% of your expected worst-case runtime, not average runtime. Functions approaching the timeout limit can be designed to checkpoint state to a Volume and restart from the last checkpoint rather than running the full job in one invocation.
Use concurrency_limit to bound costs for GPU workloads. Without a limit, a sudden traffic spike can provision hundreds of GPU containers simultaneously, each incurring per-second GPU billing. Set concurrency_limit to a value you can afford: for a web endpoint on A10G at $1.10/hr, 10 concurrent containers costs at most $11/hr during peak — predictable and within budget. Excess requests queue rather than failing, since Modal's scheduler holds them until a container becomes available.
Use modal serve for development iteration, modal deploy for production. modal serve starts the app with a file watcher that rebuilds and redeploys on code changes — ideal for rapid iteration on inference logic. Avoid serving production traffic through modal serve; it terminates when you close the terminal. modal deploy creates a persistent deployment that survives terminal closure, restarts on failure, and runs scheduled jobs.
Separate environment-specific configuration with Function.with_options(). Rather than maintaining separate app files for staging and production, define functions with production defaults and override with .with_options() for staging: staging_fn = production_fn.with_options(gpu='T4', concurrency_limit=2). This avoids drift between staging and production function definitions while still allowing resource tier differences.
Mount local files into containers with modal.Mount rather than hardcoding paths. Configuration files, prompt templates, and tokenizer vocabularies that live in your repository can be mounted into the container without rebuilding the image: @app.function(mounts=[modal.Mount.from_local_file('./prompts.yaml', remote_path='/app/prompts.yaml')]). This separates environment (image) from configuration (mount), enabling prompt updates without image rebuilds.
Use modal.Dict for distributed state when functions need to coordinate. modal.Dict is a distributed key-value store accessible from any function in the app. It is useful for deduplication state in parallel map() workloads, rate limiting counters, and progress tracking across concurrent containers. Unlike a database, it requires no connection management and is accessible with a single line: results_dict = modal.Dict.from_name('job-results', create_if_missing=True).
Monitor deployed endpoints with modal logs and structured logging. modal logs --follow <app-name> streams live logs from all containers to your terminal during development. In production, add structured JSON logging to functions so log aggregation systems can parse fields: json.dumps({'event': 'request', 'latency_ms': elapsed, 'tokens': count, 'model': model_name}). Set up Slack alerts by calling a webhook in a @modal.exit() handler that fires when a container terminates with an unhandled exception.
Your GPU instances run at 5% utilization between training jobs because you keep them alive to avoid cold start latency, your ML team deploys model updates by SSHing into servers and modifying files in-place with no version control or reproducibility, or your inference endpoint costs $3,000 per month even though peak traffic only lasts 4 hours per day?
We design and implement Modal-based ML infrastructure — from image definition with pip_install() layer caching strategy for fast iterative deploys and gpu= tier selection across T4, A10G, A100, and H100 for cost-to-performance optimization, through modal.Cls architecture with @modal.enter() model loading and allow_concurrent_inputs batching for warm-container inference serving, modal.Volume setup for persistent model weight storage with commit() durability and cross-function access patterns, @modal.web_endpoint() and @modal.asgi_app() FastAPI serving with Bearer token authentication and CORS configuration, modal.Cron and modal.Period scheduled job configuration for nightly fine-tuning and periodic embedding refresh runs, modal.Secret wiring for HuggingFace, WandB, and database credentials with dashboard rotation without redeployment, Function.map() parallel processing pipelines for dataset embedding and batch offline inference with return_exceptions fault tolerance, container_idle_timeout and concurrency_limit tuning for throughput versus cost trade-offs, and modal logs structured logging with @modal.exit() Slack webhook integration for production monitoring and failure alerting. Let’s talk.
Let's Talk