Building an Enterprise Chatbot: React, FastAPI, and WebSocket Architecture

Every enterprise wants an AI chatbot now. Most of the tutorials out there will get you a working prototype in an afternoon. Deploying that prototype to production for a Fortune 500 client with 10,000 concurrent users, strict data isolation requirements, and a CFO watching the LLM API bill? That is a different engineering problem entirely. I have built and operated chatbot systems at this scale across regulated industries (healthcare, financial services, government), and the gap between "it works on my laptop" and "it handles production load without bleeding money" is enormous. This article covers the architecture I have landed on after iterating through several generations of enterprise chatbot deployments: React on the frontend, FastAPI on the backend, WebSockets for real-time communication, and a layered storage and caching strategy that keeps costs sane.

Skip this if you want a "build your first chatbot" walkthrough. What follows is a production architecture reference for engineers who need to ship a chatbot that survives contact with enterprise users, enterprise compliance teams, and enterprise budgets.

System Architecture Overview

High-Level Component Map

The architecture separates concerns into five layers: client, gateway, application, model, and storage. Each layer scales independently. Each layer can be swapped without rewriting the others.

Enterprise chatbot system architecture

The React frontend maintains a persistent WebSocket connection to the FastAPI backend through an Application Load Balancer. Every connection gets authenticated via JWT before the WebSocket handshake completes. Messages flow through a router that handles rate limiting, prompt caching, and LLM dispatch. Conversation state lives in Redis for active sessions and DynamoDB for durable history.

Why WebSockets Over SSE

I have shipped chatbots using both Server-Sent Events (SSE) and WebSockets. The conventional wisdom says SSE is simpler, and for basic chat-and-respond patterns, that is true. For enterprise chatbots, WebSockets win on three fronts.

Criterion	SSE	WebSocket	Winner
Direction	Server-to-client only	Full duplex	WebSocket
Client-side streaming	Requires separate HTTP POST	Native bidirectional	WebSocket
Connection overhead	New HTTP request per message sent	Single persistent connection	WebSocket
Binary data	Text only (Base64 encoding required)	Native binary frames	WebSocket
Mid-stream cancellation	Client closes connection; server may not notice	Client sends cancel frame; server stops immediately	WebSocket
Typing indicators	Separate polling endpoint	Native push in both directions	WebSocket
Browser connection limit	6 per domain (HTTP/1.1)	No practical limit	WebSocket
Load balancer support	Standard HTTP	Requires sticky sessions or connection draining	SSE
Infrastructure complexity	Low	Moderate	SSE
Debugging	curl + browser DevTools	Specialized tooling	SSE

The mid-stream cancellation point deserves emphasis. Enterprise users cancel generations constantly: wrong prompt, too slow, changed their mind. With SSE, the client drops the connection, but the server keeps generating tokens (and burning money) until it notices the disconnect. With WebSockets, the client sends an explicit cancel frame, and the server aborts the LLM call within milliseconds. At 10,000 concurrent users, that difference saves thousands of dollars per month in wasted tokens.

For a deeper comparison of real-time protocols including gRPC, see Real-Time Messaging Protocols: WebSockets, SSE, gRPC, Long Polling, and MQTT Compared.

The WebSocket Layer

Connection Lifecycle

Every WebSocket connection follows a strict lifecycle. Understanding each phase matters because failure modes differ at each stage.

WebSocket connection lifecycle

The JWT travels in the initial HTTP upgrade request, either as a query parameter (?token=...) or in the Sec-WebSocket-Protocol header. I prefer the header approach because query parameters end up in access logs and can leak credentials. The server validates the token, extracts the tenant ID and user ID, and registers the connection in Redis before completing the handshake.

Connection Management and Scaling

A single FastAPI instance running on Uvicorn handles between 1,000 and 5,000 concurrent WebSocket connections, depending on the message throughput and how much CPU the LLM streaming consumes. Here are the practical limits I have measured across deployments.

Resource	Single Instance	4-Instance Cluster	16-Instance Cluster
Concurrent connections	2,000	8,000	32,000
Messages per second	500	2,000	8,000
Memory per connection	~15 KB	~15 KB	~15 KB
Memory overhead (total)	30 MB	120 MB	480 MB
Redis pub/sub channels	0 (unnecessary)	4	16
Sticky sessions required	No	Yes	Yes

When scaling beyond a single instance, Redis pub/sub becomes mandatory. Each server instance only knows about its own connections. When User A on Instance 1 sends a message that needs to trigger a notification to User B on Instance 3, the message routes through Redis pub/sub. Without this, multi-instance deployments silently lose messages.

# Connection manager with Redis pub/sub for multi-instance scaling
class ConnectionManager:
    def __init__(self, redis_client: Redis):
        self.active: dict[str, WebSocket] = {}
        self.redis = redis_client
        self.pubsub = redis_client.pubsub()

    async def connect(self, user_id: str, websocket: WebSocket):
        await websocket.accept()
        self.active[user_id] = websocket
        await self.redis.sadd("ws:connections", user_id)

    async def disconnect(self, user_id: str):
        self.active.pop(user_id, None)
        await self.redis.srem("ws:connections", user_id)

    async def send_to_user(self, user_id: str, message: dict):
        if user_id in self.active:
            await self.active[user_id].send_json(message)
        else:
            # User is on a different instance; publish to Redis
            await self.redis.publish(
                f"ws:user:{user_id}",
                json.dumps(message)
            )

Heartbeats, Reconnection, and Failure Modes

WebSocket connections die silently. The TCP connection stays open, but no data flows. Without heartbeats, you accumulate zombie connections that consume memory and file descriptors until the server runs out of both.

I run server-side pings every 30 seconds with a 10-second timeout. If the client fails to respond with a pong within 10 seconds, the server closes the connection and cleans up. On the client side, the React app detects disconnections and implements exponential backoff reconnection: 1 second, 2 seconds, 4 seconds, 8 seconds, capped at 30 seconds.

The three failure modes I see most in production:

Load balancer idle timeout. ALBs close connections after 60 seconds of inactivity by default. Set the idle timeout to 3600 seconds (1 hour) and rely on application-level heartbeats to keep connections alive.
Deployment-induced disconnections. Rolling deployments drain connections from old instances. Configure a 120-second connection draining period so active conversations complete before the instance shuts down.
Client backgrounding on mobile. Mobile browsers kill WebSocket connections when the tab goes to the background. The React client stores pending messages in localStorage and replays them on reconnect.

Backend: FastAPI and the Message Pipeline

Request Flow and Message Routing

Every incoming WebSocket message follows a pipeline: authenticate, validate, rate-check, cache-check, route, stream, persist. Here is the core message handler.

@app.websocket("/ws/chat")
async def chat_endpoint(websocket: WebSocket, token: str = Query(...)):
    # 1. Authenticate
    claims = verify_jwt(token)
    tenant_id = claims["tenant_id"]
    user_id = claims["sub"]

    # 2. Connect
    await manager.connect(user_id, websocket)

    try:
        while True:
            data = await websocket.receive_json()
            msg_type = data.get("type")

            if msg_type == "message":
                # 3. Rate check
                if not await rate_limiter.allow(tenant_id, user_id):
                    await websocket.send_json({
                        "type": "error",
                        "code": "RATE_LIMITED",
                        "message": "Too many requests. Try again shortly."
                    })
                    continue

                # 4. Cache check
                cached = await prompt_cache.get(data["content"], tenant_id)
                if cached:
                    await websocket.send_json({
                        "type": "response",
                        "content": cached,
                        "cached": True
                    })
                    continue

                # 5. Stream from LLM
                await stream_llm_response(
                    websocket, tenant_id, user_id, data
                )

            elif msg_type == "cancel":
                await cancel_active_generation(user_id)

            elif msg_type == "ping":
                await websocket.send_json({"type": "pong"})

    except WebSocketDisconnect:
        await manager.disconnect(user_id)

The cancel message type is critical. When the user clicks "Stop generating," the client sends {"type": "cancel"}, and the server aborts the active LLM API call. With Anthropic's Claude API, this means closing the streaming connection. With OpenAI, you cancel the async task. Either way, you stop burning tokens the instant the user loses interest.

LLM Integration and Streaming

The LLM gateway abstracts provider differences behind a uniform streaming interface. Every provider streams differently (Anthropic uses Server-Sent Events, OpenAI uses chunked JSON, AWS Bedrock uses event streams), but the chatbot backend normalizes everything into WebSocket JSON frames.

async def stream_llm_response(
    websocket: WebSocket,
    tenant_id: str,
    user_id: str,
    message: dict
):
    conversation = await history.get_recent(tenant_id, user_id, limit=20)

    # Send typing indicator
    await websocket.send_json({"type": "typing", "active": True})

    full_response = []
    async for chunk in llm_gateway.stream(
        model=get_tenant_model(tenant_id),
        messages=conversation + [{"role": "user", "content": message["content"]}],
        system=get_tenant_system_prompt(tenant_id),
    ):
        full_response.append(chunk)
        await websocket.send_json({
            "type": "token",
            "content": chunk,
        })

    # Stop typing indicator
    await websocket.send_json({"type": "typing", "active": False})

    # Persist
    complete_text = "".join(full_response)
    await history.save(tenant_id, user_id, message["content"], complete_text)

    # Cache for future hits
    await prompt_cache.set(message["content"], complete_text, tenant_id)

Each tenant gets configured with a different model. A cost-sensitive tenant runs Claude Haiku. A tenant needing maximum quality gets Claude Opus. The gateway handles this routing transparently.

Conversation History: Redis + DynamoDB

Active conversations live in Redis for sub-millisecond reads. Completed conversations persist to DynamoDB for durable storage and compliance. This dual-layer approach keeps the hot path fast without sacrificing durability.

Layer	Store	Data	TTL	Access Pattern
L1: Working memory	Redis Hash	Last 20 messages	2 hours	Read on every message
L2: Session history	Redis List	Full session transcript	24 hours	Read on reconnect
L3: Durable archive	DynamoDB	Complete conversation	Indefinite	Read for compliance, analytics
L4: Summarized context	Redis String	LLM-generated summary	7 days	Injected into system prompt

The L4 summarization layer is what makes long conversations work. After 20 messages, the raw conversation exceeds most context windows. The system generates a rolling summary using a cheap, fast model (Claude Haiku) and injects it into the system prompt. The user sees a seamless conversation; the backend keeps token counts manageable.

For production DynamoDB table design, see AWS DynamoDB: An Architecture Deep-Dive. For Redis caching patterns at scale, see Amazon ElastiCache: An Architecture Deep-Dive.

Frontend: React Chat Interface

Component Architecture

The React frontend follows a strict component hierarchy. State flows down through props. Events bubble up through callbacks. The WebSocket connection lives in a context provider so every component can send and receive messages without prop drilling.

ChatApp
├── WebSocketProvider          # Manages connection lifecycle
│   ├── ChatHeader             # Title, status indicator, settings
│   ├── MessageList            # Scrollable message container
│   │   ├── Message            # Individual message bubble
│   │   │   ├── MarkdownRenderer  # Renders LLM markdown output
│   │   │   └── CodeBlock      # Syntax-highlighted code
│   │   ├── TypingIndicator    # Animated dots during generation
│   │   └── ScrollAnchor       # Auto-scroll to bottom
│   └── InputArea              # Text input, send button, cancel
│       ├── TextInput          # Auto-resizing textarea
│       ├── FileUpload         # Drag-and-drop attachments
│       └── ActionButtons      # Send, cancel, clear

Streaming Text Rendering

The streaming renderer accumulates tokens and re-renders the Markdown output as each chunk arrives. Naive implementations re-parse the entire accumulated text on every token, which causes visible jank at high token rates. The solution: buffer tokens for 50 milliseconds before triggering a re-render.

function useStreamingMessage(websocket: WebSocket) {
  const [content, setContent] = useState("");
  const bufferRef = useRef("");
  const timerRef = useRef<number | null>(null);

  useEffect(() => {
    const handler = (event: MessageEvent) => {
      const data = JSON.parse(event.data);
      if (data.type === "token") {
        bufferRef.current += data.content;

        if (!timerRef.current) {
          timerRef.current = window.setTimeout(() => {
            setContent(prev => prev + bufferRef.current);
            bufferRef.current = "";
            timerRef.current = null;
          }, 50);
        }
      }
    };

    websocket.addEventListener("message", handler);
    return () => websocket.removeEventListener("message", handler);
  }, [websocket]);

  return content;
}

That 50-millisecond buffer is the difference between 60 FPS scrolling and a stuttering mess. Claude Opus streams at roughly 80 tokens per second. Without buffering, that is 80 React re-renders per second, each triggering a Markdown parse, DOM diff, and scroll adjustment. With the buffer, you get 20 re-renders per second: smooth, responsive, and kind to mobile devices.

State Management for Conversations

I use useReducer over useState for conversation state. Chat state has too many interdependent fields (messages, loading, error, active generation ID, typing status) for individual useState calls to manage cleanly. A reducer keeps state transitions explicit and debuggable.

type ChatAction =
  | { type: "ADD_USER_MESSAGE"; content: string }
  | { type: "START_STREAMING"; messageId: string }
  | { type: "APPEND_TOKEN"; messageId: string; token: string }
  | { type: "FINISH_STREAMING"; messageId: string }
  | { type: "CANCEL_GENERATION" }
  | { type: "SET_ERROR"; error: string }
  | { type: "CLEAR_CONVERSATION" };

function chatReducer(state: ChatState, action: ChatAction): ChatState {
  switch (action.type) {
    case "ADD_USER_MESSAGE":
      return {
        ...state,
        messages: [...state.messages, {
          id: crypto.randomUUID(),
          role: "user",
          content: action.content,
          timestamp: Date.now(),
        }],
      };
    case "START_STREAMING":
      return {
        ...state,
        isStreaming: true,
        activeMessageId: action.messageId,
        messages: [...state.messages, {
          id: action.messageId,
          role: "assistant",
          content: "",
          timestamp: Date.now(),
        }],
      };
    case "APPEND_TOKEN":
      return {
        ...state,
        messages: state.messages.map(m =>
          m.id === action.messageId
            ? { ...m, content: m.content + action.token }
            : m
        ),
      };
    // ... remaining cases
  }
}

Multi-Tenancy and Authentication

JWT-Based Tenant Isolation

Every WebSocket connection carries a JWT that identifies both the user and the tenant. The JWT gets validated during the WebSocket handshake, before the 101 Switching Protocols response. If the token is invalid, expired, or missing the tenant claim, the server rejects the upgrade with a 401.

The tenant ID drives everything downstream: which LLM model to use, which system prompt to inject, which rate limits to apply, which DynamoDB partition to write conversation history to. There is no code path where a message from Tenant A can reach Tenant B's data. The isolation happens at the application layer (tenant-scoped queries), the cache layer (tenant-prefixed Redis keys), and the storage layer (tenant-partitioned DynamoDB tables).

# Every Redis key includes the tenant prefix
REDIS_KEY_PATTERN = "chat:{tenant_id}:{user_id}:{key}"

# Every DynamoDB query scopes to the tenant partition
table.query(
    KeyConditionExpression=Key("pk").eq(f"TENANT#{tenant_id}#USER#{user_id}")
)

For a deep dive on Cognito-based authentication that integrates with this pattern, see AWS Cognito User Authentication: An Architecture Deep-Dive.

Per-Tenant Rate Limiting

Rate limits protect both the system and the budget. Without them, a single enthusiastic user can exhaust the LLM API quota for the entire platform. I use a token bucket algorithm implemented in Redis, with bucket sizes configured per tenant tier.

Tenant Tier	Messages/Minute	Messages/Hour	Max Concurrent Streams	Token Budget/Month
Free	10	100	1	100,000
Professional	30	500	3	1,000,000
Enterprise	60	2,000	10	10,000,000
Unlimited	120	No limit	25	No limit

The token budget is the outer boundary. Even if a tenant stays within their per-minute rate limit, once they exhaust their monthly token allocation, the system returns a clear error message directing them to upgrade or wait for the next billing cycle. This prevents bill shock on both sides.

class TenantRateLimiter:
    async def allow(self, tenant_id: str, user_id: str) -> bool:
        key = f"ratelimit:{tenant_id}:{user_id}"
        pipe = self.redis.pipeline()

        # Token bucket: add tokens at configured rate, cap at bucket size
        now = time.time()
        pipe.hget(key, "tokens")
        pipe.hget(key, "last_refill")
        tokens, last_refill = await pipe.execute()

        bucket = self.get_tenant_bucket(tenant_id)
        elapsed = now - float(last_refill or now)
        current = min(
            bucket.max_tokens,
            float(tokens or bucket.max_tokens) + elapsed * bucket.refill_rate
        )

        if current < 1:
            return False

        await self.redis.hset(key, mapping={
            "tokens": current - 1,
            "last_refill": now,
        })
        return True

Cost Control

Token Pricing Reality

LLM APIs charge per token. At enterprise scale, this adds up fast. Here is what a 10,000-user chatbot actually costs, based on real deployment numbers.

Model	Input (per 1M tokens)	Output (per 1M tokens)	Avg. Cost per Conversation	10K Users, 5 Convos/Day
Claude Opus 4.6	$15.00	$75.00	$0.18	$9,000/day
Claude Sonnet 4.6	$3.00	$15.00	$0.036	$1,800/day
Claude Haiku 4.5	$0.80	$4.00	$0.010	$500/day
GPT-4o	$2.50	$10.00	$0.025	$1,250/day
GPT-4o-mini	$0.15	$0.60	$0.002	$75/day

Those numbers assume 500 input tokens and 200 output tokens per exchange, 4 exchanges per conversation. The Opus column should make every engineering manager's eyes water. The difference between Opus and Haiku for a customer support chatbot is $8,500 per day, and for most customer support use cases, Haiku handles the work just fine.

Caching Strategies

Three cache tiers reduce LLM API costs by 40-80% in my deployments.

Exact match cache. Identical prompts (same user message, same system prompt, same conversation context) return a cached response instantly. Hit rate: 5-15% in support chatbots, higher for FAQ-heavy deployments.

Semantic cache. Messages with similar meaning (e.g., "How do I reset my password?" and "I forgot my password, how do I reset it?") return the same cached response. Implementation: embed the user message with a fast embedding model, search Redis for vectors within a cosine similarity threshold of 0.95. Hit rate: 15-30% in support chatbots.

Prompt caching (provider-level). Anthropic and OpenAI both offer prompt caching that discounts repeated prefixes (system prompts, few-shot examples). Anthropic charges 10% of the standard rate for cached input tokens. For chatbots with long system prompts, this alone cuts input costs by 80-90%.

Model Routing

The smartest cost optimization is sending each message to the cheapest model that can handle it. A simple classifier (itself running on a cheap model) categorizes incoming messages as "simple" (FAQ, greeting, basic lookup), "moderate" (multi-step reasoning, summarization), or "complex" (analysis, code generation, creative writing). Simple messages route to Haiku. Moderate messages route to Sonnet. Complex messages route to Opus. This tiered routing typically saves 60-70% compared to sending everything to a single premium model.

Observability and Monitoring

What to Log

Every LLM interaction generates structured telemetry. Skip any of these fields and you will regret it during the first production incident.

Metric	Source	Why It Matters
Latency (time to first token)	FastAPI middleware	Users notice delays over 500ms
Latency (total generation)	LLM gateway	Slow generations indicate model issues
Token count (input/output)	LLM API response	Cost tracking, budget enforcement
Cache hit/miss	Prompt cache	Validates caching strategy effectiveness
Error rate by type	Exception handler	Distinguishes API errors from application bugs
Active WebSocket connections	Connection manager	Capacity planning, leak detection
Messages per second	Rate limiter	Traffic pattern analysis, abuse detection
Model selection distribution	Model router	Validates routing classifier accuracy
Tenant-level usage	All layers	Billing, quota enforcement, capacity planning
Conversation length	History store	Context window management, summarization triggers

Tracing Multi-Step Interactions

A single user message can trigger a chain of operations: embedding lookup, knowledge base retrieval, prompt assembly, model call, response caching, history persistence. Without distributed tracing, debugging a slow response means guessing which step took too long.

I instrument every step with OpenTelemetry spans. Each span records the operation name, duration, and relevant metadata (model name, token count, cache status). The entire chain rolls up under a single trace ID tied to the user's message ID. Langfuse and Datadog both ingest these traces and provide dashboards for latency percentiles, cost breakdown by tenant, and error attribution.

from opentelemetry import trace

tracer = trace.get_tracer("chatbot.llm")

async def stream_llm_response(websocket, tenant_id, user_id, message):
    with tracer.start_as_current_span("llm.generate") as span:
        span.set_attribute("tenant.id", tenant_id)
        span.set_attribute("model", get_tenant_model(tenant_id))

        with tracer.start_as_current_span("history.fetch"):
            conversation = await history.get_recent(tenant_id, user_id)

        with tracer.start_as_current_span("cache.check"):
            cached = await prompt_cache.get(message["content"], tenant_id)
            span.set_attribute("cache.hit", cached is not None)

        if not cached:
            with tracer.start_as_current_span("llm.stream"):
                async for chunk in llm_gateway.stream(...):
                    await websocket.send_json({"type": "token", "content": chunk})

The Demo Application

A companion repository demonstrates the patterns described in this article. The application implements a fully functional chatbot with React, FastAPI, and WebSocket streaming.

Repository Structure

enterprise-chatbot-demo/
├── backend/
│   ├── app/
│   │   ├── main.py              # FastAPI app + WebSocket endpoint
│   │   ├── auth.py              # JWT validation
│   │   ├── connection_manager.py # WebSocket connection tracking
│   │   ├── llm_gateway.py       # LLM provider abstraction
│   │   ├── rate_limiter.py      # Token bucket rate limiting
│   │   ├── conversation.py      # History storage (Redis + SQLite)
│   │   └── config.py            # Environment configuration
│   ├── requirements.txt
│   └── Dockerfile
├── frontend/
│   ├── src/
│   │   ├── App.tsx
│   │   ├── components/
│   │   │   ├── ChatWindow.tsx    # Main chat container
│   │   │   ├── MessageList.tsx   # Scrollable message area
│   │   │   ├── Message.tsx       # Individual message bubble
│   │   │   ├── InputArea.tsx     # Text input + controls
│   │   │   └── TypingIndicator.tsx
│   │   ├── hooks/
│   │   │   ├── useWebSocket.ts   # Connection management
│   │   │   └── useChat.ts       # Chat state reducer
│   │   └── context/
│   │       └── WebSocketContext.tsx
│   ├── package.json
│   └── Dockerfile
├── docker-compose.yml            # Full stack: frontend, backend, Redis
└── README.md

Running Locally

git clone https://github.com/CharlesSieg/enterprise-chatbot-demo.git
cd enterprise-chatbot-demo
cp .env.example .env
# Add your ANTHROPIC_API_KEY to .env
docker-compose up

The application starts three containers: the React frontend on port 3000, the FastAPI backend on port 8000, and Redis on port 6379. Open http://localhost:3000 and start chatting. The demo uses Claude Haiku by default to keep costs minimal.

Key Implementation Details

The demo uses SQLite instead of DynamoDB for conversation history (no AWS account required) and includes a mock authentication layer that issues JWTs locally. The rate limiter, connection manager, and streaming pipeline match the production architecture described in this article. Swap SQLite for DynamoDB and the mock auth for Cognito, and the demo becomes production infrastructure.

Key Patterns

Use WebSockets for enterprise chatbots. The bidirectional communication pays for itself in mid-stream cancellation savings alone. SSE works for simpler use cases; enterprise deployments need the control WebSockets provide.
Implement cancellation from day one. Users cancel generations constantly. Without explicit cancel handling, you burn tokens on responses nobody reads. The WebSocket cancel frame propagates instantly; SSE requires the server to notice a dropped connection.
Layer your storage. Redis for hot data (active sessions, caches). DynamoDB for durable history. S3 for attachments. Each layer optimized for its access pattern, each layer independently scalable.
Route by model tier. Sending everything to the most expensive model is lazy engineering. A simple classifier routing messages to the cheapest capable model saves 60-70% on LLM costs without meaningful quality degradation for most queries.
Cache aggressively. Exact match, semantic, and provider-level prompt caching stack together. Combined, they reduce LLM API calls by 40-80% in support chatbot deployments.
Isolate tenants at every layer. Tenant-prefixed Redis keys, tenant-partitioned DynamoDB tables, tenant-scoped rate limits. Cross-tenant data leakage in an enterprise chatbot is a career-ending incident.
Buffer streaming renders. Re-rendering Markdown on every token at 80 tokens per second destroys UI performance. A 50-millisecond buffer drops render frequency to 20 FPS: smooth and efficient.
Trace everything. Every LLM call, every cache check, every rate limit decision. OpenTelemetry spans with tenant and model attributes. You cannot optimize what you do not measure.