Practical Claude API Patterns: What Works in Production
We've been running the Claude API in production for several months now, powering AI features in client and internal projects. These are patterns that emerged from real usage — not toy examples from a weekend project, but things we learned from handling thousands of generation requests from actual users.
If you're building production features on top of the Claude API, some of this might save you time.
Structured Output with System Prompts
The single most important pattern: tell Claude exactly what format you want. Vague instructions produce vague outputs. If you need JSON with a specific shape, describe that shape precisely in the system prompt.
In our production systems, every AI call includes a system prompt that spells out the exact JSON schema we expect. The model is remarkably good at following this when the instructions are unambiguous.
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
system: `You are a domain expert assistant.
Return ONLY valid JSON matching this schema:
{
"title": string,
"sections": [{
"heading": string,
"items": [{
"name": string,
"description": string,
"priority": "high" | "medium" | "low",
"metadata": Record<string, string>
}]
}]
}
Do not include markdown fencing or any text outside the JSON.`,
messages: [{ role: "user", content: userPrompt }],
});
const text = response.content[0].type === "text"
? response.content[0].text
: "";
// Always validate with zod before trusting the shape
const result = outputSchema.parse(JSON.parse(text));The key detail: we validate every response with zodbefore it touches the rest of the system. The model gets it right the vast majority of the time, but "vast majority" is not "always." A zod parse catches malformed responses before they corrupt your data or crash your UI.
Error Handling That Actually Works
The Claude API can fail in several ways: rate limits (429), overloaded errors (529), network timeouts, and occasionally responses that are valid text but not valid JSON. You need to handle all of these.
We wrap every Claude call in a retry function with exponential backoff. This handles transient failures without hammering the API.
async function callClaudeWithRetry<T>(
fn: () => Promise<T>,
options: { maxRetries?: number; baseDelay?: number } = {}
): Promise<T> {
const { maxRetries = 3, baseDelay = 1000 } = options;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error: any) {
const isRetryable =
error?.status === 429 ||
error?.status === 529 ||
error?.error?.type === "overloaded_error";
if (!isRetryable || attempt === maxRetries) {
throw error;
}
const delay = baseDelay * Math.pow(2, attempt);
const jitter = delay * 0.5 * Math.random();
await new Promise((r) => setTimeout(r, delay + jitter));
}
}
throw new Error("Retry loop exited unexpectedly");
}
// Usage
const result = await callClaudeWithRetry(() =>
anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
system: PROMPTS.analyzeDocument.system,
messages: [{ role: "user", content: prompt }],
})
);The jitter is important. If you hit a rate limit, you don't want all your retries firing at the exact same moment. Randomizing the delay spreads the load.
Streaming for Better UX
A complex generation can take 8–15 seconds. That's a long time to stare at a spinner. Streaming changes the experience completely — users see results being built in real time, which makes the wait feel productive instead of anxious.
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function streamGeneration(
userPrompt: string,
onChunk: (text: string) => void
): Promise<string> {
let fullText = "";
const stream = client.messages.stream({
model: "claude-sonnet-4-20250514",
max_tokens: 4096,
system: PROMPTS.analyzeDocument.system,
messages: [{ role: "user", content: userPrompt }],
});
stream.on("text", (text) => {
fullText += text;
onChunk(text);
});
const finalMessage = await stream.finalMessage();
return fullText;
}
// In an API route, pipe chunks to the client via SSE
// or use a ReadableStream responseOn the frontend, we accumulate the streamed text and attempt a JSON parse on each chunk. Once the JSON becomes valid, we render the partial result. Before that point, we show a progress animation with the raw text streaming in. It's a small detail that makes the product feel significantly more responsive.
Prompt Management
Don't hardcode prompts inline with your application logic. It seems fine when you have one prompt, but you'll end up with several, and you'll want to version them, A/B test them, and roll back when a change makes things worse.
We keep prompts in a dedicated module with version numbers and template functions. Nothing fancy — just enough structure to stay organized.
interface AnalysisVars {
documentType: string;
focusAreas: string[];
outputDepth: "summary" | "detailed" | "comprehensive";
}
const PROMPTS = {
analyzeDocument: {
version: "2.1",
system: `You are a document analysis expert.
Return ONLY valid JSON. Focus on actionable insights.
Flag ambiguities explicitly rather than guessing.`,
template: (vars: AnalysisVars) =>
`Analyze this ${vars.documentType} with focus on:
${vars.focusAreas.join(", ")}.
Output depth: ${vars.outputDepth}.`,
},
};
// When prompt changes break things, you can trace back to
// a specific version and revertThe version number is just a string we log alongside every API call. When a user reports a bad output, we can check which prompt version generated it. This has saved us debugging time more than once.
Cost Control
Claude API calls cost real money at scale. A few things we do to keep costs predictable:
- Set max_tokens intentionally.Don't default to 8192 when 2048 is enough for your use case. Profile your actual outputs — most structured responses need far fewer tokens than you'd guess.
- Track token usage per request. Log input and output tokens so you can spot cost anomalies early.
- Cache identical inputs. If two users submit the exact same request within a short window, serve the cached response instead of making another API call.
function trackTokenUsage(
response: Anthropic.Messages.Message,
metadata: { userId: string; promptVersion: string }
) {
const usage = {
inputTokens: response.usage.input_tokens,
outputTokens: response.usage.output_tokens,
model: response.model,
timestamp: new Date().toISOString(),
...metadata,
};
// We push this to a DynamoDB table for analysis
// but console.log works fine to start
console.log("[token-usage]", JSON.stringify(usage));
// Alert if a single request exceeds expected bounds
if (usage.outputTokens > 5000) {
console.warn(
`[cost-alert] High token usage: ${usage.outputTokens} output tokens`
);
}
}We review token usage weekly. It's caught a few issues — including a prompt change that accidentally doubled output length because we removed a "be concise" instruction.
Testing AI Outputs
You can't unit test AI outputs the way you test deterministic code. The same input can produce different outputs across calls. But you can still build a useful test suite with three layers:
- Schema validation. Does the response parse as valid JSON? Does it match the expected zod schema? This catches structural failures and runs in CI like any other test.
- Assertion testing. For a given input, does the output contain the expected number of sections? Are key fields non-empty? Do values fall within reasonable ranges? These are loose but meaningful checks.
- Human review for prompt changes.When we update a prompt version, we generate 10–15 sample outputs across different input variations, then manually review them before deploying. There's no shortcut for this — someone has to read the output and confirm it makes sense.
The combination of automated schema checks and periodic human review gives us enough confidence to ship. We don't aim for perfect — we aim for "the output is always structurally correct and almost always useful."
Wrapping Up
None of these patterns are groundbreaking on their own. Structured outputs, retries, streaming, prompt versioning, cost tracking, assertion tests — they're all standard engineering practices adapted for a non-deterministic API. The value is in applying them consistently from the start, rather than bolting them on after something breaks in production.
If you're building on the Claude API and want to compare notes, feel free to reach out. We're still iterating on all of this, and the patterns keep evolving as the API and models improve.
We help teams ship production software — from serverless architectures and AI features to cross-platform mobile apps. If you're building something and need engineering help, let's talk.