LangGraph: Production-Ready Workflow Orchestration

This guide starts with the minimum LangGraph fundamentals, then moves into production architecture, implementation, and operational practices.

Core LangGraph Concepts

StateGraph: a workflow graph that coordinates nodes and edges around a shared state. Constraint: state must be serializable and stable across steps.
Node: a single step function that transforms state. Pitfall: nodes that do too much are hard to test and debug.
Conditional edges: routing based on state. Pitfall: missing or ambiguous routes create dead ends.

Architecture

A production LangGraph system has:

State contract: explicit schema and validation.
Workflow graph: nodes for classification, action, and review.
Runner: retries, logging, and budget controls.
Observability: node timings, route distribution, error rate.

This design fits LangGraph because its explicit state and routing can be monitored and evolved safely.

Step-by-Step Implementation

Step 1: Define State Contract

Purpose: ensure every node can rely on consistent input.

from pydantic import BaseModel

class WorkflowState(BaseModel):
    request_id: str
    input_text: str
    route: str | None = None
    response: str | None = None

Validation: invalid or missing fields fail fast before execution.

Step 2: Build the Workflow Graph

Purpose: encode routing and fallback behavior.

from langgraph.graph import StateGraph
from openai import AzureOpenAI
import os

builder = StateGraph(WorkflowState)

client = AzureOpenAI(
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_API_KEY"],
    api_version="2024-02-01"
)

async def classify_with_llm(text: str) -> str:
    prompt = (
        "Classify risk as 'high' or 'low' only. "
        "High risk includes refunds, security, or legal issues. "
        f"Ticket: {text}"
    )
    resp = client.responses.create(model="gpt-4.1-mini", input=prompt)
    out = (resp.output_text or "").lower()
    return "high" if "high" in out else "low"

def heuristic_risk(text: str) -> str:
    return "high" if "refund" in text.lower() else "low"

@builder.add_node
async def classify(state: WorkflowState):
    try:
        risk = await classify_with_llm(state.input_text)
    except Exception:
        risk = heuristic_risk(state.input_text)
    state.route = "human_review" if risk == "high" else "auto"
    return state

@builder.add_node
async def review(state: WorkflowState):
    state.response = "Queued for human review."
    return state

@builder.add_node
async def auto_reply(state: WorkflowState):
    state.response = "We can help. Please share your order ID."
    return state

builder.set_entry_point("classify")
builder.add_conditional_edges(
    "classify",
    lambda s: s.route,
    {"human_review": "review", "auto": "auto_reply"},
)
builder.set_finish_point("review")
builder.set_finish_point("auto_reply")

graph = builder.compile()

Validation: every route leads to a terminal node; missing routes fail fast.

Step 3: Production Runner (Retries + Logging)

Purpose: control failures and provide traceability.

import time
import logging

logger = logging.getLogger("workflow")
logger.setLevel(logging.INFO)

MAX_RETRIES = 2
MAX_TOKENS = 600

async def run_workflow(request_id: str, state: WorkflowState):
    start = time.time()
    for attempt in range(1, MAX_RETRIES + 1):
        try:
            result = await graph.ainvoke(state, config={"max_output_tokens": MAX_TOKENS})
            latency_ms = int((time.time() - start) * 1000)
            logger.info("workflow_ok", extra={"request_id": request_id, "latency_ms": latency_ms})
            return result
        except Exception as exc:
            logger.warning("workflow_retry", extra={"request_id": request_id, "attempt": attempt, "error": str(exc)})
            time.sleep(0.2 * attempt)
    raise RuntimeError("workflow_failed")

Validation: retry behavior triggers on transient errors; logs include request IDs.

Common Mistakes & Anti-Patterns

Overloaded nodes: makes testing impossible. Fix: keep nodes atomic.
Missing fallback routes: causes dead ends. Fix: always define a fallback path.
No observability: you can’t diagnose production incidents. Fix: log node timings and routes.

Testing & Debugging

Unit test each node in isolation.
Use a golden set of inputs to validate routing accuracy.
Replay failed requests from logs to reproduce issues.

Trade-offs & Alternatives

Limitations: more complexity than linear chains.
When not to use: one-step or stateless workflows.
Alternatives: simple async flows, queue-based workers, or state machines.

Rollout Checklist

State schema validated
Retry policy configured
Monitoring dashboards live
Rollback tested