Systems·April 21, 2026·10 min

Chat Completions API: What I Learned Running It in Production

The Chat Completions API looks simple until you add tools, memory, and real users. Here's what I changed to make it hold up under production load.

Black-box API gateway receiving message packets and emitting structured response frames.

Chat Completions API: What I Learned Running It in Production

The easy version is easy. A prompt goes in, text comes out, the demo passes.

Then real requirements land. The assistant needs memory. It needs to call tools. It has to stream in the UI, survive bad inputs, stay inside a budget, and keep working when users push it past the test cases. That's where the chat completions API stops being a simple text endpoint and starts acting like infrastructure.

This post is about what I changed to make it hold up. The focus is durability: multi-turn context, tool orchestration, token discipline, and failure handling that doesn't collapse when traffic scales or prompts get messy.

Why I still start with Chat Completions

You ship a first version, usage climbs, and the weak points show up fast. Context gets messy, prompts drift, token spend climbs, and every new tool call creates another place for the system to fail.

Chat Completions is still the right starting point because it gives you a controlled interface before you take on heavier orchestration. The API works on a list of messages instead of one growing prompt string, so conversation state has a clear structure from the start. That structure maps well to support assistants, internal copilots, and agent loops where the model needs to respond based on prior turns without hiding state inside brittle prompt templates.

The main benefit: message history is a better production primitive than prompt concatenation. Roles are explicit. Turns are inspectable. Failures are easier to trace because you can see exactly what the model received, what it returned, and which tool or policy decision went wrong.

In my setup, the teams that get stable behavior fastest are usually the ones that keep the first loop boring: append messages correctly, validate outputs, log token usage, and make retries deterministic. Teams that skip that work and jump straight into multi-agent planners usually end up debugging state corruption, runaway token growth, and impossible-to-reproduce tool errors.

What this API solves well

Chat Completions fits when you need structured conversation state with clear separation between instructions, user input, and assistant output. It gives you a stable request shape that works across support flows, content operations, copilots, and agent controllers. The full conversation payload can be logged, inspected, trimmed, and replayed, which makes debugging much more tractable than with a single prompt string.

This is also where cost discipline starts. If the base conversation loop is sloppy, every advanced feature gets more expensive. Long transcripts inflate tokens. Poor summarization strategy causes repeated context. Tool calls without guardrails create waste and latency at the same time.

If your single-agent loop is unstable, adding planning or more agents usually increases cost and failure rate instead of improving results.

Where it fits in my stack

In production, I treat Chat Completions as the control surface for a few common patterns:

Chat assistant. The API generates context-aware replies. My app owns session state, auth, and the UI.

Workflow agent. The API decides the next response or tool call. My app owns tool execution, retries, and validation.

Knowledge assistant. The API answers using prior turns and supplied context. My app owns retrieval, summarization, and context trimming.

The main trade-off is clear. Chat Completions gives you control, and your application owns state management. For serious products, that's usually the right trade. State ownership is where reliability, privacy, cost control, and product quality are decided.

The first API call

A production bug often starts with a request that looked fine in a quick test. The model returned text, the demo passed, and nobody checked token usage, finish reasons, or what the response shape would look like once tools and retries entered the picture.

Your first call should prove more than connectivity. It should confirm four things: credentials are loaded correctly, the model returns the shape your application expects, usage is visible in logs, and the request body is simple enough to debug under pressure.

Python example

This is the shape I recommend for a first non-streaming request:

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Explain what the chat completions api does."}
    ]
)

print(response.choices[0].message.content)
print(response.usage)

This request is intentionally plain. No streaming, no tools, no structured output, no custom retry wrapper. Early on, simplicity beats feature coverage because you need a clean baseline before you add the parts that fail in less obvious ways.

Two fields deserve attention right away. choices[0].message.content is usually the text you render or pass downstream. usage is the first line item in your cost model, so log it on day one, not after the first surprise bill.

Node.js example

import OpenAI from "openai";

const client = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

const response = await client.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    { role: "system", content: "You are a concise assistant." },
    { role: "user", content: "Explain what the chat completions api does." }
  ]
});

console.log(response.choices[0].message.content);
console.log(response.usage);

The useful part is not syntax parity. The request shape stays stable across services. That makes it easier to keep behavior consistent when one team is shipping a backend worker in Python and another is wiring a Node service that handles UI traffic.

What actually matters in the request and response

A first call only needs a few fields, but each one has an operational consequence:

model chooses the capability, latency profile, and price point you are buying. messages defines the conversation input; keep it short at first so failures are easy to inspect. system sets broad behavior; make it specific enough to constrain tone and task, but avoid a wall of policy text in the first test. user carries the immediate instruction or question. choices contains the generated result. finish_reason tells you how generation stopped; check it before assuming the output is complete. usage shows token consumption and belongs in logs, metrics, and cost alerts.

I also log the raw response at least once during setup. That single habit catches a lot of bad assumptions early.

Mistakes that break later

The common first-call mistakes are boring, and they still cause expensive cleanup work:

Hardcoded API keys instead of environment variables or a secret manager.
One giant prompt string with no role separation.
No usage logging, which hides token growth until traffic scales.
No check on finish_reason, which lets truncated outputs slip into downstream systems.
Treating free-form text as a stable schema, then discovering later that parsing fails on edge cases.

I treat the first request as the contract test for the rest of the system. If that contract is loose, every agentic feature added later gets harder to trust. Start with one clean round-trip, inspect the full response, and make sure the request is easy to replay before you add concurrency, streaming, or tool execution.

Context and roles

A chat feature usually looks fine in staging. Then a real user returns to the same thread five times, asks for revisions, references an earlier answer, and the assistant starts contradicting itself. That's the point where prompt quality stops being the main concern. State management becomes the job.

Chat Completions gives you a clean structure for that job. Roles separate long-lived instructions, the current user request, and prior assistant output. Keep those concerns separated and debugging stays manageable. Collapse them into one prompt string and every later fix gets harder.

Use roles for control, not decoration

A conversation loop should start with explicit role boundaries:

conversation = [
    {"role": "system", "content": "You are a helpful assistant that answers in plain English."},
    {"role": "user", "content": "Summarize this product spec."}
]

After the model replies, append the assistant message back into the same list before the next turn. That preserves the visible record of what the assistant already committed to.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=conversation,
    temperature=0.7,
    max_tokens=1024
)

conversation.append({
    "role": "assistant",
    "content": response.choices[0].message.content
})

Each role has a different operational purpose. System holds durable rules, tone, and guardrails. User carries the current task, correction, or follow-up. Assistant records prior answers that future turns may depend on.

The failure mode is predictable. Teams stuff formatting rules, business logic, retrieval results, and user intent into a single user message because it's easy to ship quickly. It works for a prototype. In production, it makes drift harder to diagnose because you can't tell whether the model ignored policy, misunderstood intent, or got buried under too much context.

Context growth is the real problem

Conversation history grows on every turn. If nothing gets trimmed, every old answer competes with the current request for attention and tokens.

The first symptom is usually subtle. Answers get wordier, constraints get dropped, and the assistant starts favoring stale details from earlier in the thread. After that, truncation and incomplete replies show up. In agentic systems, this gets worse because tool outputs, retrieved documents, and intermediate notes all consume context too.

I treat context growth as a production issue, not a prompt issue. Durable systems need a memory policy before they need a fancier prompt.

Long chat histories rarely break all at once. They drift first, then they truncate.

What works in production

A practical policy has three parts.

First, keep the current turn intact. Preserve the latest user message and the most recent assistant reply. Cutting the active exchange is the fastest way to get incoherent follow-ups.

Second, keep durable instructions separate. System rules should stay stable and easy to inspect. Don't bury them inside a running transcript.

Third, compress older history on purpose. Replace stale turns with a short summary that keeps decisions, constraints, open questions, and any facts the assistant must not lose.

This is the pattern I trust for long-running threads:

Keep the system message
Keep the last few turns in full
Summarize older turns into one compact memory block
Drop low-value chatter, repeated acknowledgments, and verbose tool output
Refuse oversized requests or compress them before they hit the model limit

That last point matters for cost as much as reliability. Raw transcript replay is expensive, and it gets worse as usage scales. If a thread can live for days or weeks, summarization is not an optimization. It's basic hygiene.

A better memory pattern

For assistants connected to inboxes, docs, tickets, or internal tools, replaying the full transcript is a weak memory strategy. Use layered memory instead.

System layer: stable rules, output style, safety constraints. Working memory: recent turns and active task state. Compressed memory: summaries of prior decisions and facts. External memory: files, notes, or retrieved records outside the prompt.

This structure gives you sharper trade-offs. Working memory stays small and relevant. Compressed memory preserves continuity without dragging the full transcript forward. External memory keeps large documents out of the prompt until they're needed.

The reliable part of a conversational product is usually not the model output. It's the discipline around what the model sees on each turn. That's what keeps an assistant usable after the tenth message, after tool calls start, and after real users push it far past the happy path.

Streaming, errors, and rate limits

Friday at 4:57 PM, your chat UI is streaming answers fine in staging. At 5:06 PM, support starts reporting frozen replies, duplicated actions, and bursts of 429s after a traffic spike. That's the real test for a Chat Completions integration. Not whether it can answer. Whether it can stay usable under pressure.

Streaming changes the failure mode users see first. With a normal response, they wait and then get either a full answer or an error. With streaming, they see partial output, then maybe a stall, a dropped connection, or a truncated reply. That feels faster when it works. It also means your frontend and backend both need stronger state handling.

Streaming versus non-streaming

Non-streaming is still the right choice for a lot of production work. Batch classification, offline enrichment, scheduled jobs, and webhook pipelines are usually easier to run and easier to debug without a live stream.

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages
)

text = response.choices[0].message.content

Streaming earns its keep in chat, drafting, and copilot-style interfaces where users benefit from early tokens.

stream = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=messages,
    stream=True
)

for chunk in stream:
    delta = chunk.choices[0].delta.content
    if delta:
        print(delta, end="")

The production trade-off is simple. Streaming improves perceived latency, but it increases coordination work. You need to handle partial text, client disconnects, timeouts mid-generation, and the case where the model finishes but your UI never receives the tail end of the stream.

One implementation detail matters for observability. Streaming can return usage information at the end of the response, which makes token accounting much easier than older stream handling patterns. Capture that final event and store it with the request log. If you skip it, your cost reporting will drift.

Error handling needs categories

A single retry rule is one of the fastest ways to waste money and hide real bugs.

Split failures by class. Invalid request errors mean your payload, schema, or parameters are wrong; fix the caller. Authentication and permission errors mean your credentials, project setup, or environment wiring are broken. Rate limit errors mean the request was valid but rejected because you exceeded allowed throughput. Server-side errors and transport failures are the cases where retries make sense.

In practice, plenty of teams still retry 400-class request mistakes in a loop, especially after adding tool calls or structured outputs. I've seen malformed JSON schema definitions burn through a rate budget faster than any real user traffic.

If you're deploying across providers, don't assume rate limits work the same way everywhere. Some platforms publish fixed request and token budgets. Others tie throughput to deployment configuration or purchased capacity. Build your client so limits are configurable per environment, model, and tenant. Hardcoding one global retry policy is how multi-provider rollouts fail.

A resilient retry loop

Good retry logic has five parts: retry only errors that can succeed on a second attempt, use exponential backoff, add jitter, set a hard attempt cap, and log the error class and request metadata on every failure.

The missing sixth part is admission control. If your queue is already backing up, retries can turn a brief provider slowdown into a full outage on your side. Put a concurrency cap in front of the API and shed low-priority traffic before everything stalls.

Retrying bad requests looks safe in dashboards because traffic stays high, but it burns tokens, floods logs, and delays the fix.

Idempotency matters just as much as retries. If a model response can trigger an email, create a ticket, or update a record, your application needs an idempotency key or a dedupe layer before executing the side effect. A timeout after the model decides to act doesn't mean the action never happened.

A practical walkthrough of streaming behavior and request handling:

What to watch in logs

Logs need to answer two questions fast. Was the failure caused by your application, or by the provider? Did the user see partial output before it failed?

Capture these fields together: model name, request ID if available, whether stream was enabled, time to first token, time to last token or disconnect, finish reason, usage data, retry count, and high-level error class.

For agentic systems, add one more layer. Log the tool decision, the tool execution result, and whether the final model turn completed after the tool returned. In my setup, that split has been the difference between blaming the model and finding the issue in a slow internal service or a duplicate action handler.

The goal is not perfect uptime. The goal is controlled failure, predictable cost, and enough telemetry to fix the problem before users stop trusting the system.

Tool use and agent workflows

A user asks for a newsletter draft. The model has enough context to sound convincing, but not enough to pull the right research notes, check project status, or save the result in the right place. That gap is where tool use starts to matter. In production, Chat Completions works best as a planner that can request actions, while your application stays in charge of execution, validation, and side effects.

In my setup, this is usually the point where simple prompt engineering stops carrying the system. Once an agent can search, read, and write through tools, the quality of the workflow depends less on the model's wording and more on the contracts around each tool. Poor contracts create expensive bugs. Good contracts make the agent predictable enough to ship.

A real agent loop

Take a content drafting agent connected to a PARA-style notes vault. A durable loop looks like this:

The user asks for a draft.
The model decides it needs supporting material.
It returns a tool call for vault search.
Your application runs that search outside the model.
You append the tool result back into the conversation.
The model writes the draft using the returned material, or asks for another tool if the first result wasn't enough.

That last step matters. Tool use is often recursive. A search result may lead to a follow-up call for a specific note, a task record, or a calendar item. If your app assumes one tool call means the workflow is done, the agent stalls in the middle of its job.

Structured outputs beat regex

If another service needs to consume the output, return structured data and validate it. Don't scrape free-form prose with regex unless you're willing to debug edge cases forever.

Use a defined schema for tool arguments and machine-readable responses when the model supports it. In practice, that means narrower fields, fewer optional properties, and explicit enums where possible. The trade-off is reduced flexibility. The gain is that downstream code can reject bad payloads early instead of letting malformed arguments leak into search, storage, or external APIs.

A good default: let the model write natural language for user-facing text. Require structured JSON for anything your application will execute, store, or pass to another service.

The tool pattern I use

The flow is straightforward, but the failure modes are not.

Plan: the model decides whether a tool is needed; your app provides available tools. Call: the model emits tool name and arguments; your app validates and executes. Return: the model reads tool output; your app appends the result into messages. Respond: the model produces the final answer or the next tool call; your app loops until completion.

The application side needs stricter rules than the model side. Validate the tool name against an allowlist. Validate arguments against a schema. Apply authorization checks before execution, not after. Normalize tool results into a format the model can consume without extra guessing.

That pattern sounds obvious. It breaks in small ways. A model calls search_notes with an empty query. A retrieval tool returns 200 results and blows up token usage. A save tool gets called twice after a retry path. Durable agents come from handling those cases on purpose.

Where teams go wrong

The common failures are usually operational: too many tools exposed at once (tool selection gets worse as overlap increases), weak schemas (ambiguous arguments force your app to guess), no validation layer (bad calls reach real systems), tool results that are too large (retrieval succeeds, then context size and cost spike), no permission boundary (the model can ask for actions the user shouldn't trigger), and single-turn assumptions (the workflow stops after the first tool response instead of continuing until completion).

The fix is usually to reduce surface area. One tool should do one job. "Search vault," "get note," and "save draft" are good boundaries. "Manage content operations" is how teams end up with giant schemas, vague arguments, and hard-to-reproduce failures.

Production patterns that hold up

A few patterns consistently make agentic systems easier to run:

Separate read tools from write tools. Read paths can be more permissive. Write paths need tighter checks, stronger audit logs, and explicit user intent.

Return compact tool payloads. Send summaries, IDs, and the top few relevant fields. Fetch full records only when needed.

Annotate every tool result with status. Success, empty result, partial result, and permission denied should be distinct states.

Cap tool loops. Set a maximum number of tool rounds so the agent can't wander through expensive recursion.

Keep human-readable explanations out of tool arguments. Arguments should be plain data, not mini-prompts.

On Azure-hosted deployments, configuration can trip teams up. Deployment-specific model names and custom base_url settings affect how you build wrappers across environments, so keep that mapping explicit in config rather than scattering it through application code.

The practical goal is not to make the model autonomous. It's to build an agent loop that can search, decide, act, and recover without turning your application into a pile of special cases.

Cost and performance

Once an agent works, the next problem is usually economics.

Chat Completions gives you enough telemetry to run it like a system instead of a black box. Every response includes usage data. If you aren't capturing it, you're making cost decisions blind.

Start with usage, not intuition

The easiest waste pattern is prompt bloat. Teams obsess over output length and ignore the huge amount of text they keep sending in every request. System prompts grow. History grows. Retrieved context grows. Then someone asks why the monthly bill climbed.

The fix is operational, not philosophical. Log prompt_tokens, completion_tokens, and total_tokens. Store usage by feature, not just by model. Compare high-cost routes against actual business value. Review long prompts like code, because they are code.

A drafting assistant, a support agent, and an internal note searcher shouldn't all carry the same instruction payload. Specialized prompts are usually cheaper and more reliable than one giant universal prompt.

Choose models by task shape

There's no single best model choice for all workloads. A lightweight model is often enough for classification, extraction, routing, or simple drafting. More capable models make sense when ambiguity is high, tool selection is complex, or output quality matters more than speed.

Short classification: smaller, faster model. Basic summarization: smaller model, tight prompt. Tool-heavy planning: more capable model. High-stakes drafting: more capable model with validation.

That sounds obvious, but many teams still route everything through one expensive path because it's easier than building task-aware dispatch.

Cut waste without hurting quality

Three tactics usually pay off fast.

First, trim prompts aggressively. Remove repeated instructions, duplicate retrieved passages, and decorative wording the model doesn't need.

Second, cache stable context in your application. If the same user opens the same workspace and asks variations on a known task, you don't need to rebuild all scaffolding every time.

Third, split workflows. A cheap model can route or summarize before a stronger model handles only the final generation step.

The cheapest token is the one you never send. Most savings come from prompt discipline, not clever accounting afterward.

What not to optimize too early

Don't start with micro-optimizing every token. Start with identifying the expensive paths that happen often. A compact prompt that causes bad outputs isn't efficient if it forces retries, manual review, or follow-up calls.

Good cost control balances three things: token volume, latency, and downstream correctness. If one optimization hurts the other two, it's probably not an optimization.

Debugging checklist

When an agent starts behaving badly, check these in order.

Request integrity

Is the API key correct for the environment? Is the endpoint and base URL right? (Hosted environments often need different configuration.) Does the model name match the deployment or provider expectation? Does each message have the expected role and content?

Conversation quality

Has the system prompt drifted? Make sure instructions aren't duplicated or contradicted. Did you forget to append prior assistant replies? If so, the model will act forgetful. Is context bloated? Trim or summarize older turns before quality degrades. What's the finish_reason? Don't ignore it when output is cut off or a tool call is pending.

Tool workflow issues

Is there a schema mismatch? Tighten tool argument definitions. Is there validation? Reject malformed tool arguments before execution. Did you drop a recursive step? If the model asks for a tool, execute it and call the model again. Is parsing fragile? Prefer structured JSON over regex extraction.

Resilience and operations

How are 429s handled? Back off and retry with limits. Are non-retryable client errors being retried? Fix the payload instead of hammering the endpoint. Are usage logs missing? Without them, cost and context debugging both get harder. Are there duplicate side effects? Add idempotency around any action that changes state outside the model.

The fastest way to debug LLM systems is to log the exact conversation state you sent, the exact response you got, and the exact action your app took next.

If you build around that discipline, the API stays manageable even as the product gets more complex.

Some links on this site may be affiliate links. I only recommend tools I use. If you click through and make a purchase, I may earn a small commission at no extra cost to you.