Why Freeform LLM Output Breaks Production
Every team that integrates an LLM into a production system eventually hits the same wall: the model returns text, and the application needs data. At first, string parsing feels manageable — a few regex patterns, a JSON block extracted with a slice, a hand-rolled parser that handles the two output variants you have seen so far. Then it's 3 AM on a Tuesday and the model started appending a markdown code fence it didn't include yesterday, and your JSON parser throws an unhandled exception that silently swallows an entire batch of invoice records.
The problem compounds downstream. REST consumers expecting a typed InvoiceRecord object get None because the deserialization silently failed. Analytics pipelines see nulls where they expected amounts. An ML feature pipeline skips rows with missing fields and quietly degrades model accuracy over the next three weeks. None of these failures produce an obvious error — they produce silent data corruption.
The root cause is always the same: the contract between the LLM and the consuming system was expressed in natural language rather than a machine-verifiable schema. The fix is to move validation to the boundary — the moment the model response enters your system — using tools designed for exactly this purpose.
Note
KeyError raised six function calls deep in a payment processing flow.The Three Methods — JSON Mode, Tool Calling, and Native Structured Outputs
There are three primary mechanisms for constraining LLM output to a specific shape. Each has different guarantees, different failure modes, and different implementation complexity. Choosing the right one for your use case is the first design decision in any structured extraction pipeline.
JSON Mode
Both the OpenAI and Anthropic APIs support a JSON mode that instructs the model to return valid JSON. The guarantee is syntactic only — the response will be parseable JSON, but the schema of that JSON is not enforced. You may get every field, some fields, or entirely different fields than you asked for. JSON mode is appropriate for exploratory use cases where schema flexibility is acceptable; it is not sufficient for production extraction pipelines where downstream consumers have rigid expectations.
Tool Calling
Tool calling (also called function calling) allows you to define a JSON Schema for a “tool” and force the model to call it. The model's response is not free text — it is a structured tool call whose arguments conform to the schema you provided. This is the most widely supported and battle-tested mechanism for structured extraction across providers, and it is the approach we will focus on throughout this article.
Native Structured Outputs
OpenAI's Structured Outputs feature (introduced in 2024) goes one step further than tool calling by enforcing the schema at the model generation level using a constrained decoding algorithm. This provides stronger guarantees than tool calling alone, at the cost of some latency overhead. Anthropic's equivalent is forced tool calling with a specific tool name — a pattern we will implement in Section 4.
The following example shows all three approaches for extracting an invoice entity from unstructured text, so you can compare the code surface area and guarantee level:
# compare_extraction_methods.py
import json
import anthropic
import openai
TEXT = """
Invoice #INV-2026-0042 from Acme Supplies Ltd
Date: May 15, 2026
Due: June 15, 2026
Items:
- Cloud infrastructure (monthly): $4,200.00
- Support SLA tier 2: $850.00
Total due: $5,050.00
"""
# ── Method 1: JSON mode (syntactic guarantee only) ──────────────────────────
openai_client = openai.OpenAI()
json_mode_response = openai_client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{
"role": "system",
"content": "Extract invoice data as JSON with fields: invoice_number, vendor, total_amount, due_date.",
},
{"role": "user", "content": TEXT},
],
)
# No schema guarantee — fields may be missing or renamed
raw = json.loads(json_mode_response.choices[0].message.content)
print("JSON mode:", raw)
# ── Method 2: Tool calling (schema-constrained) ─────────────────────────────
INVOICE_TOOL = {
"name": "extract_invoice",
"description": "Extract structured invoice data from the provided text.",
"input_schema": {
"type": "object",
"properties": {
"invoice_number": {"type": "string"},
"vendor_name": {"type": "string"},
"invoice_date": {"type": "string", "description": "ISO 8601 date"},
"due_date": {"type": "string", "description": "ISO 8601 date"},
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": {"type": "string"},
"amount_usd": {"type": "number"},
},
"required": ["description", "amount_usd"],
},
},
"total_amount_usd": {"type": "number"},
},
"required": ["invoice_number", "vendor_name", "total_amount_usd", "due_date"],
},
}
anthropic_client = anthropic.Anthropic()
tool_response = anthropic_client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=[INVOICE_TOOL],
tool_choice={"type": "tool", "name": "extract_invoice"}, # forced
messages=[{"role": "user", "content": TEXT}],
)
tool_call = next(b for b in tool_response.content if b.type == "tool_use")
print("Tool calling:", tool_call.input)
# ── Method 3: OpenAI Structured Outputs (constrained decoding) ───────────────
from pydantic import BaseModel
from typing import Optional
from openai.lib._parsing import type_to_response_format_param
class LineItem(BaseModel):
description: str
amount_usd: float
class InvoiceExtraction(BaseModel):
invoice_number: str
vendor_name: str
due_date: str
total_amount_usd: float
line_items: list[LineItem]
structured_response = openai_client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[
{"role": "system", "content": "Extract the invoice data."},
{"role": "user", "content": TEXT},
],
response_format=InvoiceExtraction,
)
parsed: InvoiceExtraction = structured_response.choices[0].message.parsed
print("Structured outputs:", parsed)Schema Design with Pydantic
Pydantic v2 is the de-facto standard for defining structured schemas in Python LLM applications. Its strengths — automatic JSON Schema generation, runtime validation, type coercion, and descriptive error messages — map directly to the requirements of a structured extraction pipeline. The key insight is that model.model_json_schema() gives you exactly the JSON Schema you need to pass as a tool definition to the LLM.
# invoice_schema.py
from __future__ import annotations
import json
from datetime import date
from enum import Enum
from typing import Optional, Union
from pydantic import BaseModel, Field, field_validator
class Currency(str, Enum):
USD = "USD"
EUR = "EUR"
GBP = "GBP"
PLN = "PLN"
class LineItem(BaseModel):
description: str = Field(..., description="Human-readable line item description")
quantity: Optional[float] = Field(None, ge=0)
unit_price: Optional[float] = Field(None, ge=0)
amount: float = Field(..., description="Total amount for this line item")
currency: Currency = Currency.USD
class VendorInfo(BaseModel):
name: str
tax_id: Optional[str] = Field(None, description="VAT number or EIN, if present")
address: Optional[str] = None
class InvoiceExtraction(BaseModel):
"""Structured extraction schema for vendor invoices."""
invoice_number: str = Field(..., description="Invoice identifier as printed on the document")
vendor: VendorInfo
invoice_date: Optional[date] = None
due_date: Optional[date] = None
line_items: list[LineItem] = Field(default_factory=list)
subtotal: Optional[float] = None
tax_amount: Optional[float] = None
total_amount: float = Field(..., description="Final amount due, after tax")
currency: Currency = Currency.USD
payment_terms: Optional[str] = Field(None, description="e.g. Net 30, Due on receipt")
notes: Optional[str] = None
# Union type: either fully parsed or a fallback raw string
po_reference: Union[str, None] = Field(
None, description="Purchase order number referenced on the invoice"
)
@field_validator("total_amount")
@classmethod
def total_must_be_positive(cls, v: float) -> float:
if v <= 0:
raise ValueError(f"total_amount must be positive, got {v}")
return v
# Generate the JSON Schema for the LLM tool definition
schema = InvoiceExtraction.model_json_schema()
print(json.dumps(schema, indent=2))
# Use this schema in the "input_schema" field of your Anthropic tool definitionNote
description fields to your Pydantic models. The LLM reads these descriptions when deciding how to populate each field. A field named amount with no description is ambiguous — the same field with description="Total amount for this line item, inclusive of discounts" significantly improves extraction accuracy on edge-case documents.Anthropic SDK — Tool Use for Structured Extraction
The Anthropic tool use API supports a tool_choiceparameter that forces the model to call a specific named tool. This is the key to schema-constrained extraction: instead of asking the model to “return JSON”, you define a tool whose sole purpose is to receive the extracted data, and the model must call it. The model cannot respond with free text — it must produce a structured tool call that conforms to the tool's input schema.
# anthropic_extraction.py
import json
import anthropic
from pydantic import BaseModel, ValidationError
from invoice_schema import InvoiceExtraction
client = anthropic.Anthropic()
def build_tool_from_pydantic(model_cls: type[BaseModel], tool_name: str, description: str) -> dict:
"""Convert a Pydantic model into an Anthropic tool definition."""
return {
"name": tool_name,
"description": description,
"input_schema": model_cls.model_json_schema(),
}
INVOICE_TOOL = build_tool_from_pydantic(
InvoiceExtraction,
tool_name="extract_invoice",
description=(
"Extract all structured invoice data from the provided document text. "
"Populate every field you can identify. Use null for fields not present in the document."
),
)
def extract_invoice(document_text: str) -> InvoiceExtraction:
"""
Extract structured invoice data from unstructured document text.
Uses forced tool calling to guarantee the model returns the expected schema.
"""
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=[INVOICE_TOOL],
tool_choice={"type": "tool", "name": "extract_invoice"}, # force this specific tool
messages=[
{
"role": "user",
"content": (
"Extract all invoice data from the following document.\n\n"
f"{document_text}"
),
}
],
)
# The model must have returned a tool_use block
tool_block = next(
(b for b in response.content if b.type == "tool_use" and b.name == "extract_invoice"),
None,
)
if tool_block is None:
raise ValueError(
f"Model did not call the extract_invoice tool. stop_reason={response.stop_reason}"
)
# Validate the tool call arguments against our Pydantic schema
return InvoiceExtraction.model_validate(tool_block.input)
# Usage
SAMPLE_DOC = """
Invoice #INV-2026-0042 from Acme Supplies Ltd (VAT: GB123456789)
Invoice Date: 2026-05-15 | Due Date: 2026-06-15 | PO: PO-2026-0019
Cloud infrastructure (monthly) ........... USD 4,200.00
Support SLA tier 2 ........................ USD 850.00
Subtotal: USD 5,050.00
Tax (0%): USD 0.00
Total Due: USD 5,050.00
Payment Terms: Net 30
"""
invoice = extract_invoice(SAMPLE_DOC)
print(invoice.model_dump_json(indent=2))Note
tool_choice={"type": "tool", "name": "..."}, the model is not allowed to use any other tool or respond with free text. This eliminates the most common failure mode of tool-calling extraction: the model deciding to respond conversationally instead of calling the tool, which happens when the document is ambiguous or the model is uncertain.Validation and Retry Patterns
Even with forced tool calling, the model can produce values that pass JSON Schema validation but fail Pydantic's richer validators — a negative total amount, a due date before the invoice date, a line item sum that doesn't match the stated subtotal. The correct response to these validation failures is not to crash or silently corrupt data: it is to send the validation error back to the model and ask it to correct its extraction.
The retry-with-feedback pattern is one of the highest-value patterns in production LLM systems. It converts hard failures into model corrections, and in practice eliminates the vast majority of extraction errors on the second attempt. Use tenacity for exponential backoff on transient network errors, and a manual retry loop for validation failures where you want to inject error context into the conversation.
# retry_extraction.py
import json
import time
import anthropic
from pydantic import ValidationError
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from invoice_schema import InvoiceExtraction
client = anthropic.Anthropic()
MAX_VALIDATION_RETRIES = 3
@retry(
retry=retry_if_exception_type(anthropic.APIConnectionError),
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10),
reraise=True,
)
def _call_model(messages: list, tools: list) -> anthropic.types.Message:
"""Call the Anthropic API with exponential backoff on transient errors."""
return client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=tools,
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=messages,
)
def extract_with_retry(
document_text: str,
tool_def: dict,
max_retries: int = MAX_VALIDATION_RETRIES,
) -> InvoiceExtraction:
"""
Extract invoice data with automatic retry on validation failure.
Sends the validation error back to the model as context for the next attempt.
"""
messages: list[dict] = [
{
"role": "user",
"content": (
"Extract all invoice data from the following document.\n\n"
f"{document_text}"
),
}
]
for attempt in range(1, max_retries + 1):
response = _call_model(messages, [tool_def])
tool_block = next(
(b for b in response.content if b.type == "tool_use" and b.name == "extract_invoice"),
None,
)
if tool_block is None:
raise ValueError(f"Model did not call the tool on attempt {attempt}")
# Append model response to history (required for multi-turn context)
messages.append({"role": "assistant", "content": response.content})
try:
return InvoiceExtraction.model_validate(tool_block.input)
except ValidationError as exc:
if attempt == max_retries:
raise RuntimeError(
f"Extraction failed after {max_retries} attempts. "
f"Last validation error:\n{exc}"
) from exc
# Send validation error back to model as tool result + correction request
error_detail = json.dumps(
[{"field": e["loc"], "message": e["msg"]} for e in exc.errors()],
indent=2,
)
messages.append({
"role": "user",
"content": [
{
"type": "tool_result",
"tool_use_id": tool_block.id,
"content": (
f"Validation failed with the following errors:\n{error_detail}\n\n"
"Please re-call the extract_invoice tool with corrected values."
),
"is_error": True,
}
],
})
# Should not be reached
raise RuntimeError("Extraction failed: exceeded retry loop bounds")Streaming Structured Outputs
Streaming is useful when you want to show the user progressive feedback while a long document is being processed, or when you need to start downstream processing on partial results as they arrive. Structured extraction with streaming requires accumulating the streamed tool call input deltas and parsing the complete JSON only after the stream ends. The Anthropic streaming API emits input_json_delta events for tool calls, which you accumulate into a string buffer before final parsing.
# streaming_extraction.py
import json
import anthropic
from invoice_schema import InvoiceExtraction
client = anthropic.Anthropic()
def extract_invoice_streaming(document_text: str, tool_def: dict) -> InvoiceExtraction:
"""
Extract invoice data using the streaming API.
Accumulates partial JSON deltas and validates on completion.
"""
accumulated_input = ""
tool_use_id: str | None = None
tool_name: str | None = None
with client.messages.stream(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=[tool_def],
tool_choice={"type": "tool", "name": "extract_invoice"},
messages=[
{
"role": "user",
"content": f"Extract all invoice data from this document:\n\n{document_text}",
}
],
) as stream:
for event in stream:
# Track tool use block start
if event.type == "content_block_start":
if hasattr(event.content_block, "type") and event.content_block.type == "tool_use":
tool_use_id = event.content_block.id
tool_name = event.content_block.name
accumulated_input = ""
# Accumulate partial JSON
elif event.type == "content_block_delta":
if hasattr(event.delta, "type") and event.delta.type == "input_json_delta":
accumulated_input += event.delta.partial_json
# Stream complete — final message available
elif event.type == "message_stop":
pass # final_message = stream.get_final_message()
if not accumulated_input or tool_name != "extract_invoice":
raise ValueError("Stream did not produce an extract_invoice tool call")
# Parse and validate the complete accumulated JSON
raw_data = json.loads(accumulated_input)
return InvoiceExtraction.model_validate(raw_data)Type-Safe Output Pipelines
Once you have Pydantic models and forced tool calling in place, the next step is making the extraction function generic so it works with any schema without code duplication. A typed generic function lets mypy and pyright verify that the caller receives the correct type — if you call extract(text, InvoiceExtraction), the type checker knows the return type is InvoiceExtraction, not Any.
# typed_extraction.py
from __future__ import annotations
import json
from typing import TypeVar, Generic, get_args, get_origin
from pydantic import BaseModel, ValidationError
import anthropic
T = TypeVar("T", bound=BaseModel)
client = anthropic.Anthropic()
def extract(
document_text: str,
schema: type[T],
tool_name: str = "extract_data",
tool_description: str = "Extract structured data from the provided text.",
model: str = "claude-sonnet-4-6",
max_tokens: int = 2048,
) -> T:
"""
Generic type-safe extraction function.
Returns an instance of `schema` validated against the model's tool call output.
Type-checked: mypy/pyright infers the return type from the `schema` parameter.
"""
tool_def = {
"name": tool_name,
"description": tool_description,
"input_schema": schema.model_json_schema(),
}
response = client.messages.create(
model=model,
max_tokens=max_tokens,
tools=[tool_def],
tool_choice={"type": "tool", "name": tool_name},
messages=[{"role": "user", "content": document_text}],
)
tool_block = next(
(b for b in response.content if b.type == "tool_use" and b.name == tool_name),
None,
)
if tool_block is None:
raise ValueError("Model did not produce the expected tool call")
return schema.model_validate(tool_block.input)
# ── Discriminated unions for multi-intent classification ─────────────────────
from typing import Annotated, Literal, Union
from pydantic import Field as PydanticField
class SupportTicketBug(BaseModel):
intent: Literal["bug_report"]
severity: Literal["critical", "high", "medium", "low"]
component: str
reproduction_steps: list[str]
expected_behavior: str
actual_behavior: str
class SupportTicketFeature(BaseModel):
intent: Literal["feature_request"]
title: str
description: str
business_value: str
priority: Literal["high", "medium", "low"]
class SupportTicketQuestion(BaseModel):
intent: Literal["question"]
topic: str
question_text: str
urgency: Literal["urgent", "normal", "low"]
# Discriminated union — Pydantic uses the `intent` field as the discriminator
SupportTicket = Annotated[
Union[SupportTicketBug, SupportTicketFeature, SupportTicketQuestion],
PydanticField(discriminator="intent"),
]
class ClassifiedTicket(BaseModel):
"""Wrapper to enable top-level discriminated union extraction."""
ticket: SupportTicket
# Usage
ticket = extract(
"Users are reporting that the export button crashes the app on iOS 17. "
"This worked in version 2.3.1 but broke in 2.4.0.",
ClassifiedTicket,
tool_name="classify_ticket",
tool_description="Classify a support ticket into bug_report, feature_request, or question.",
)
print(type(ticket.ticket).__name__) # SupportTicketBug
print(ticket.ticket.severity) # e.g. "high"Production Patterns
Structured extraction in production requires operational discipline beyond the extraction function itself. The following patterns address the four most common failure modes in production LLM output pipelines: silent schema drift, untraceable validation failures, inconsistent schema versions across environments, and hard crashes on documents that cannot be parsed.
Validation Logging with model_dump()
Every extraction attempt — success or failure — should be logged as a structured document. Pydantic's model_dump() gives you a serializable dict that can be shipped to your data warehouse, a log aggregator, or an observability platform.
# extraction_logger.py
import uuid
import json
from datetime import datetime, timezone
from dataclasses import dataclass, field, asdict
from typing import Any, Optional
from pydantic import BaseModel, ValidationError
@dataclass
class ExtractionRecord:
run_id: str = field(default_factory=lambda: str(uuid.uuid4()))
timestamp: str = field(default_factory=lambda: datetime.now(timezone.utc).isoformat())
schema_name: str = ""
schema_version: str = ""
model: str = ""
document_hash: str = ""
success: bool = False
extracted_data: Optional[dict[str, Any]] = None
validation_errors: Optional[list[dict]] = None
retry_count: int = 0
input_tokens: int = 0
output_tokens: int = 0
def record_success(self, instance: BaseModel, usage) -> None:
self.success = True
self.extracted_data = instance.model_dump(mode="json")
self.input_tokens = usage.input_tokens
self.output_tokens = usage.output_tokens
def record_failure(self, exc: ValidationError, retry_count: int) -> None:
self.success = False
self.retry_count = retry_count
self.validation_errors = [
{"field": list(e["loc"]), "type": e["type"], "message": e["msg"]}
for e in exc.errors()
]
def to_json(self) -> str:
return json.dumps(asdict(self), indent=2, default=str)Schema Pinning
Pin your Pydantic schema version alongside the model version in every extraction record. When you upgrade to a new model, run both the old and new model in parallel against a golden dataset before switching production traffic. Schema drift — a field the old model reliably extracted that the new model misses — is invisible without this paper trail.
Validation Logging
Log every validation failure to a dedicated table, including the raw tool call input that failed validation. This lets you identify systematic model errors — fields the model consistently misnames, types it consistently gets wrong — and use them to improve your schema descriptions or add few-shot examples to your system prompt.
Schema Registry
Store schema versions as tagged artifacts alongside your model version registry. When you change a Pydantic schema in a way that is not backward-compatible (removing a required field, changing a type), bump the schema version and treat existing extraction records as schema v1 data. Mixing schema versions in a single table is a data quality bug waiting to happen.
Graceful Degradation
On final validation failure after all retries, do not crash. Return a typed failure result — a dataclass with a success=False flag, the raw model output, and the validation errors — and route the document to a manual review queue. Silent data loss is worse than an explicit failure that can be reviewed and reprocessed.
# graceful_degradation.py
from __future__ import annotations
from dataclasses import dataclass
from typing import Generic, TypeVar, Optional, Any
from pydantic import BaseModel, ValidationError
T = TypeVar("T", bound=BaseModel)
@dataclass
class ExtractionResult(Generic[T]):
success: bool
data: Optional[T]
raw_tool_input: Optional[dict[str, Any]]
validation_errors: Optional[list[dict]]
retry_count: int
@classmethod
def ok(cls, data: T, raw: dict, retries: int) -> "ExtractionResult[T]":
return cls(success=True, data=data, raw_tool_input=raw, validation_errors=None, retry_count=retries)
@classmethod
def fail(cls, raw: dict, exc: ValidationError, retries: int) -> "ExtractionResult[T]":
errors = [{"field": list(e["loc"]), "msg": e["msg"]} for e in exc.errors()]
return cls(success=False, data=None, raw_tool_input=raw, validation_errors=errors, retry_count=retries)
def safe_extract(document_text: str, schema: type[T], tool_def: dict) -> ExtractionResult[T]:
"""Extract with graceful degradation — never raises on validation failure."""
import anthropic
client = anthropic.Anthropic()
for attempt in range(1, 4):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=2048,
tools=[tool_def],
tool_choice={"type": "tool", "name": tool_def["name"]},
messages=[{"role": "user", "content": document_text}],
)
tool_block = next(
(b for b in response.content if b.type == "tool_use"), None
)
if tool_block is None:
continue
try:
instance = schema.model_validate(tool_block.input)
return ExtractionResult.ok(instance, tool_block.input, attempt)
except ValidationError as exc:
if attempt == 3:
return ExtractionResult.fail(tool_block.input, exc, attempt)
return ExtractionResult.fail({}, ValidationError.from_exception_data("", []), 3)Further Reading
- Anthropic Tool Use Documentation — tool definition schema, tool_choice parameter, forced tool calling, and multi-turn tool use patterns
- Pydantic v2 Documentation — model_json_schema(), field validators, discriminated unions, and model_dump() serialization
- Anthropic Messages Streaming API — input_json_delta events, streaming tool calls, and partial JSON accumulation
- anthropic-sdk-python on GitHub — Python SDK source, streaming helpers, type stubs, and usage examples
Work with us
Building LLM-powered systems and tired of brittle string parsing breaking production?
We design and implement type-safe LLM output pipelines — from Pydantic schema design and Anthropic SDK tool-use extraction to retry-with-feedback loops, streaming structured outputs, schema versioning, and production-grade validation logging. Let’s talk.
Get in touch