langgraphworkflowsorchestrationobservabilitydeployment
LangGraph: Production-Ready Workflow Orchestration
Sat Feb 07 2026
This guide starts with the minimum LangGraph fundamentals, then moves into production architecture, implementation, and operational practices.
Core LangGraph Concepts
- StateGraph: a workflow graph that coordinates nodes and edges around a shared state. Constraint: state must be serializable and stable across steps.
- Node: a single step function that transforms state. Pitfall: nodes that do too much are hard to test and debug.
- Conditional edges: routing based on state. Pitfall: missing or ambiguous routes create dead ends.
Architecture
A production LangGraph system has:
- State contract: explicit schema and validation.
- Workflow graph: nodes for classification, action, and review.
- Runner: retries, logging, and budget controls.
- Observability: node timings, route distribution, error rate.
This design fits LangGraph because its explicit state and routing can be monitored and evolved safely.
Step-by-Step Implementation
Step 1: Define State Contract
Purpose: ensure every node can rely on consistent input.
from pydantic import BaseModel
class WorkflowState(BaseModel):
request_id: str
input_text: str
route: str | None = None
response: str | None = None
Validation: invalid or missing fields fail fast before execution.
Step 2: Build the Workflow Graph
Purpose: encode routing and fallback behavior.
from langgraph.graph import StateGraph
from openai import AzureOpenAI
import os
builder = StateGraph(WorkflowState)
client = AzureOpenAI(
azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
api_key=os.environ["AZURE_OPENAI_API_KEY"],
api_version="2024-02-01"
)
async def classify_with_llm(text: str) -> str:
prompt = (
"Classify risk as 'high' or 'low' only. "
"High risk includes refunds, security, or legal issues. "
f"Ticket: {text}"
)
resp = client.responses.create(model="gpt-4.1-mini", input=prompt)
out = (resp.output_text or "").lower()
return "high" if "high" in out else "low"
def heuristic_risk(text: str) -> str:
return "high" if "refund" in text.lower() else "low"
@builder.add_node
async def classify(state: WorkflowState):
try:
risk = await classify_with_llm(state.input_text)
except Exception:
risk = heuristic_risk(state.input_text)
state.route = "human_review" if risk == "high" else "auto"
return state
@builder.add_node
async def review(state: WorkflowState):
state.response = "Queued for human review."
return state
@builder.add_node
async def auto_reply(state: WorkflowState):
state.response = "We can help. Please share your order ID."
return state
builder.set_entry_point("classify")
builder.add_conditional_edges(
"classify",
lambda s: s.route,
{"human_review": "review", "auto": "auto_reply"},
)
builder.set_finish_point("review")
builder.set_finish_point("auto_reply")
graph = builder.compile()
Validation: every route leads to a terminal node; missing routes fail fast.
Step 3: Production Runner (Retries + Logging)
Purpose: control failures and provide traceability.
import time
import logging
logger = logging.getLogger("workflow")
logger.setLevel(logging.INFO)
MAX_RETRIES = 2
MAX_TOKENS = 600
async def run_workflow(request_id: str, state: WorkflowState):
start = time.time()
for attempt in range(1, MAX_RETRIES + 1):
try:
result = await graph.ainvoke(state, config={"max_output_tokens": MAX_TOKENS})
latency_ms = int((time.time() - start) * 1000)
logger.info("workflow_ok", extra={"request_id": request_id, "latency_ms": latency_ms})
return result
except Exception as exc:
logger.warning("workflow_retry", extra={"request_id": request_id, "attempt": attempt, "error": str(exc)})
time.sleep(0.2 * attempt)
raise RuntimeError("workflow_failed")
Validation: retry behavior triggers on transient errors; logs include request IDs.
Common Mistakes & Anti-Patterns
- Overloaded nodes: makes testing impossible. Fix: keep nodes atomic.
- Missing fallback routes: causes dead ends. Fix: always define a fallback path.
- No observability: you can’t diagnose production incidents. Fix: log node timings and routes.
Testing & Debugging
- Unit test each node in isolation.
- Use a golden set of inputs to validate routing accuracy.
- Replay failed requests from logs to reproduce issues.
Trade-offs & Alternatives
- Limitations: more complexity than linear chains.
- When not to use: one-step or stateless workflows.
- Alternatives: simple async flows, queue-based workers, or state machines.
Rollout Checklist
- State schema validated
- Retry policy configured
- Monitoring dashboards live
- Rollback tested