About the author: I'm Charles Sieg, a cloud architect and platform engineer who builds apps, services, and infrastructure for Fortune 1000 clients through Vantalect. If your organization is rethinking its software strategy in the age of AI-assisted engineering, let's talk.
Every enterprise wants an AI chatbot now. Most of the tutorials out there will get you a working prototype in an afternoon. Deploying that prototype to production for a Fortune 500 client with 10,000 concurrent users, strict data isolation requirements, and a CFO watching the LLM API bill? That is a different engineering problem entirely. I have built and operated chatbot systems at this scale across regulated industries (healthcare, financial services, government), and the gap between "it works on my laptop" and "it handles production load without bleeding money" is enormous. This article covers the architecture I have landed on after iterating through several generations of enterprise chatbot deployments: React on the frontend, FastAPI on the backend, WebSockets for real-time communication, and a layered storage and caching strategy that keeps costs sane.
Skip this if you want a "build your first chatbot" walkthrough. What follows is a production architecture reference for engineers who need to ship a chatbot that survives contact with enterprise users, enterprise compliance teams, and enterprise budgets.

System Architecture Overview
High-Level Component Map
The architecture separates concerns into five layers: client, gateway, application, model, and storage. Each layer scales independently. Each layer can be swapped without rewriting the others.
The React frontend maintains a persistent WebSocket connection to the FastAPI backend through an Application Load Balancer. Every connection gets authenticated via JWT before the WebSocket handshake completes. Messages flow through a router that handles rate limiting, prompt caching, and LLM dispatch. Conversation state lives in Redis for active sessions and DynamoDB for durable history.
Why WebSockets Over SSE
I have shipped chatbots using both Server-Sent Events (SSE) and WebSockets. The conventional wisdom says SSE is simpler, and for basic chat-and-respond patterns, that is true. For enterprise chatbots, WebSockets win on three fronts.
| Criterion | SSE | WebSocket | Winner |
|---|---|---|---|
| Direction | Server-to-client only | Full duplex | WebSocket |
| Client-side streaming | Requires separate HTTP POST | Native bidirectional | WebSocket |
| Connection overhead | New HTTP request per message sent | Single persistent connection | WebSocket |
| Binary data | Text only (Base64 encoding required) | Native binary frames | WebSocket |
| Mid-stream cancellation | Client closes connection; server may not notice | Client sends cancel frame; server stops immediately | WebSocket |
| Typing indicators | Separate polling endpoint | Native push in both directions | WebSocket |
| Browser connection limit | 6 per domain (HTTP/1.1) | No practical limit | WebSocket |
| Load balancer support | Standard HTTP | Requires sticky sessions or connection draining | SSE |
| Infrastructure complexity | Low | Moderate | SSE |
| Debugging | curl + browser DevTools | Specialized tooling | SSE |
The mid-stream cancellation point deserves emphasis. Enterprise users cancel generations constantly: wrong prompt, too slow, changed their mind. With SSE, the client drops the connection, but the server keeps generating tokens (and burning money) until it notices the disconnect. With WebSockets, the client sends an explicit cancel frame, and the server aborts the LLM call within milliseconds. At 10,000 concurrent users, that difference saves thousands of dollars per month in wasted tokens.
For a deeper comparison of real-time protocols including gRPC, see Real-Time Messaging Protocols: WebSockets, SSE, gRPC, Long Polling, and MQTT Compared.
The WebSocket Layer
Connection Lifecycle
Every WebSocket connection follows a strict lifecycle. Understanding each phase matters because failure modes differ at each stage.
The JWT travels in the initial HTTP upgrade request, either as a query parameter (?token=...) or in the Sec-WebSocket-Protocol header. I prefer the header approach because query parameters end up in access logs and can leak credentials. The server validates the token, extracts the tenant ID and user ID, and registers the connection in Redis before completing the handshake.
Connection Management and Scaling
A single FastAPI instance running on Uvicorn handles between 1,000 and 5,000 concurrent WebSocket connections, depending on the message throughput and how much CPU the LLM streaming consumes. Here are the practical limits I have measured across deployments.
| Resource | Single Instance | 4-Instance Cluster | 16-Instance Cluster |
|---|---|---|---|
| Concurrent connections | 2,000 | 8,000 | 32,000 |
| Messages per second | 500 | 2,000 | 8,000 |
| Memory per connection | ~15 KB | ~15 KB | ~15 KB |
| Memory overhead (total) | 30 MB | 120 MB | 480 MB |
| Redis pub/sub channels | 0 (unnecessary) | 4 | 16 |
| Sticky sessions required | No | Yes | Yes |
When scaling beyond a single instance, Redis pub/sub becomes mandatory. Each server instance only knows about its own connections. When User A on Instance 1 sends a message that needs to trigger a notification to User B on Instance 3, the message routes through Redis pub/sub. Without this, multi-instance deployments silently lose messages.
# Connection manager with Redis pub/sub for multi-instance scaling
class ConnectionManager:
def __init__(self, redis_client: Redis):
self.active: dict[str, WebSocket] = {}
self.redis = redis_client
self.pubsub = redis_client.pubsub()
async def connect(self, user_id: str, websocket: WebSocket):
await websocket.accept()
self.active[user_id] = websocket
await self.redis.sadd("ws:connections", user_id)
async def disconnect(self, user_id: str):
self.active.pop(user_id, None)
await self.redis.srem("ws:connections", user_id)
async def send_to_user(self, user_id: str, message: dict):
if user_id in self.active:
await self.active[user_id].send_json(message)
else:
# User is on a different instance; publish to Redis
await self.redis.publish(
f"ws:user:{user_id}",
json.dumps(message)
)
Heartbeats, Reconnection, and Failure Modes
WebSocket connections die silently. The TCP connection stays open, but no data flows. Without heartbeats, you accumulate zombie connections that consume memory and file descriptors until the server runs out of both.
I run server-side pings every 30 seconds with a 10-second timeout. If the client fails to respond with a pong within 10 seconds, the server closes the connection and cleans up. On the client side, the React app detects disconnections and implements exponential backoff reconnection: 1 second, 2 seconds, 4 seconds, 8 seconds, capped at 30 seconds.
The three failure modes I see most in production:
- Load balancer idle timeout. ALBs close connections after 60 seconds of inactivity by default. Set the idle timeout to 3600 seconds (1 hour) and rely on application-level heartbeats to keep connections alive.
- Deployment-induced disconnections. Rolling deployments drain connections from old instances. Configure a 120-second connection draining period so active conversations complete before the instance shuts down.
- Client backgrounding on mobile. Mobile browsers kill WebSocket connections when the tab goes to the background. The React client stores pending messages in localStorage and replays them on reconnect.
Backend: FastAPI and the Message Pipeline
Request Flow and Message Routing
Every incoming WebSocket message follows a pipeline: authenticate, validate, rate-check, cache-check, route, stream, persist. Here is the core message handler.
@app.websocket("/ws/chat")
async def chat_endpoint(websocket: WebSocket, token: str = Query(...)):
# 1. Authenticate
claims = verify_jwt(token)
tenant_id = claims["tenant_id"]
user_id = claims["sub"]
# 2. Connect
await manager.connect(user_id, websocket)
try:
while True:
data = await websocket.receive_json()
msg_type = data.get("type")
if msg_type == "message":
# 3. Rate check
if not await rate_limiter.allow(tenant_id, user_id):
await websocket.send_json({
"type": "error",
"code": "RATE_LIMITED",
"message": "Too many requests. Try again shortly."
})
continue
# 4. Cache check
cached = await prompt_cache.get(data["content"], tenant_id)
if cached:
await websocket.send_json({
"type": "response",
"content": cached,
"cached": True
})
continue
# 5. Stream from LLM
await stream_llm_response(
websocket, tenant_id, user_id, data
)
elif msg_type == "cancel":
await cancel_active_generation(user_id)
elif msg_type == "ping":
await websocket.send_json({"type": "pong"})
except WebSocketDisconnect:
await manager.disconnect(user_id)
The cancel message type is critical. When the user clicks "Stop generating," the client sends {"type": "cancel"}, and the server aborts the active LLM API call. With Anthropic's Claude API, this means closing the streaming connection. With OpenAI, you cancel the async task. Either way, you stop burning tokens the instant the user loses interest.
LLM Integration and Streaming
The LLM gateway abstracts provider differences behind a uniform streaming interface. Every provider streams differently (Anthropic uses Server-Sent Events, OpenAI uses chunked JSON, AWS Bedrock uses event streams), but the chatbot backend normalizes everything into WebSocket JSON frames.
async def stream_llm_response(
websocket: WebSocket,
tenant_id: str,
user_id: str,
message: dict
):
conversation = await history.get_recent(tenant_id, user_id, limit=20)
# Send typing indicator
await websocket.send_json({"type": "typing", "active": True})
full_response = []
async for chunk in llm_gateway.stream(
model=get_tenant_model(tenant_id),
messages=conversation + [{"role": "user", "content": message["content"]}],
system=get_tenant_system_prompt(tenant_id),
):
full_response.append(chunk)
await websocket.send_json({
"type": "token",
"content": chunk,
})
# Stop typing indicator
await websocket.send_json({"type": "typing", "active": False})
# Persist
complete_text = "".join(full_response)
await history.save(tenant_id, user_id, message["content"], complete_text)
# Cache for future hits
await prompt_cache.set(message["content"], complete_text, tenant_id)
Each tenant gets configured with a different model. A cost-sensitive tenant runs Claude Haiku. A tenant needing maximum quality gets Claude Opus. The gateway handles this routing transparently.
Conversation History: Redis + DynamoDB
Active conversations live in Redis for sub-millisecond reads. Completed conversations persist to DynamoDB for durable storage and compliance. This dual-layer approach keeps the hot path fast without sacrificing durability.
| Layer | Store | Data | TTL | Access Pattern |
|---|---|---|---|---|
| L1: Working memory | Redis Hash | Last 20 messages | 2 hours | Read on every message |
| L2: Session history | Redis List | Full session transcript | 24 hours | Read on reconnect |
| L3: Durable archive | DynamoDB | Complete conversation | Indefinite | Read for compliance, analytics |
| L4: Summarized context | Redis String | LLM-generated summary | 7 days | Injected into system prompt |
The L4 summarization layer is what makes long conversations work. After 20 messages, the raw conversation exceeds most context windows. The system generates a rolling summary using a cheap, fast model (Claude Haiku) and injects it into the system prompt. The user sees a seamless conversation; the backend keeps token counts manageable.
For production DynamoDB table design, see AWS DynamoDB: An Architecture Deep-Dive. For Redis caching patterns at scale, see Amazon ElastiCache: An Architecture Deep-Dive.
Frontend: React Chat Interface
Component Architecture
The React frontend follows a strict component hierarchy. State flows down through props. Events bubble up through callbacks. The WebSocket connection lives in a context provider so every component can send and receive messages without prop drilling.
ChatApp
├── WebSocketProvider # Manages connection lifecycle
│ ├── ChatHeader # Title, status indicator, settings
│ ├── MessageList # Scrollable message container
│ │ ├── Message # Individual message bubble
│ │ │ ├── MarkdownRenderer # Renders LLM markdown output
│ │ │ └── CodeBlock # Syntax-highlighted code
│ │ ├── TypingIndicator # Animated dots during generation
│ │ └── ScrollAnchor # Auto-scroll to bottom
│ └── InputArea # Text input, send button, cancel
│ ├── TextInput # Auto-resizing textarea
│ ├── FileUpload # Drag-and-drop attachments
│ └── ActionButtons # Send, cancel, clear
Streaming Text Rendering
The streaming renderer accumulates tokens and re-renders the Markdown output as each chunk arrives. Naive implementations re-parse the entire accumulated text on every token, which causes visible jank at high token rates. The solution: buffer tokens for 50 milliseconds before triggering a re-render.
function useStreamingMessage(websocket: WebSocket) {
const [content, setContent] = useState("");
const bufferRef = useRef("");
const timerRef = useRef<number | null>(null);
useEffect(() => {
const handler = (event: MessageEvent) => {
const data = JSON.parse(event.data);
if (data.type === "token") {
bufferRef.current += data.content;
if (!timerRef.current) {
timerRef.current = window.setTimeout(() => {
setContent(prev => prev + bufferRef.current);
bufferRef.current = "";
timerRef.current = null;
}, 50);
}
}
};
websocket.addEventListener("message", handler);
return () => websocket.removeEventListener("message", handler);
}, [websocket]);
return content;
}
That 50-millisecond buffer is the difference between 60 FPS scrolling and a stuttering mess. Claude Opus streams at roughly 80 tokens per second. Without buffering, that is 80 React re-renders per second, each triggering a Markdown parse, DOM diff, and scroll adjustment. With the buffer, you get 20 re-renders per second: smooth, responsive, and kind to mobile devices.
State Management for Conversations
I use useReducer over useState for conversation state. Chat state has too many interdependent fields (messages, loading, error, active generation ID, typing status) for individual useState calls to manage cleanly. A reducer keeps state transitions explicit and debuggable.
type ChatAction =
| { type: "ADD_USER_MESSAGE"; content: string }
| { type: "START_STREAMING"; messageId: string }
| { type: "APPEND_TOKEN"; messageId: string; token: string }
| { type: "FINISH_STREAMING"; messageId: string }
| { type: "CANCEL_GENERATION" }
| { type: "SET_ERROR"; error: string }
| { type: "CLEAR_CONVERSATION" };
function chatReducer(state: ChatState, action: ChatAction): ChatState {
switch (action.type) {
case "ADD_USER_MESSAGE":
return {
...state,
messages: [...state.messages, {
id: crypto.randomUUID(),
role: "user",
content: action.content,
timestamp: Date.now(),
}],
};
case "START_STREAMING":
return {
...state,
isStreaming: true,
activeMessageId: action.messageId,
messages: [...state.messages, {
id: action.messageId,
role: "assistant",
content: "",
timestamp: Date.now(),
}],
};
case "APPEND_TOKEN":
return {
...state,
messages: state.messages.map(m =>
m.id === action.messageId
? { ...m, content: m.content + action.token }
: m
),
};
// ... remaining cases
}
}
Multi-Tenancy and Authentication
JWT-Based Tenant Isolation
Every WebSocket connection carries a JWT that identifies both the user and the tenant. The JWT gets validated during the WebSocket handshake, before the 101 Switching Protocols response. If the token is invalid, expired, or missing the tenant claim, the server rejects the upgrade with a 401.
The tenant ID drives everything downstream: which LLM model to use, which system prompt to inject, which rate limits to apply, which DynamoDB partition to write conversation history to. There is no code path where a message from Tenant A can reach Tenant B's data. The isolation happens at the application layer (tenant-scoped queries), the cache layer (tenant-prefixed Redis keys), and the storage layer (tenant-partitioned DynamoDB tables).
# Every Redis key includes the tenant prefix
REDIS_KEY_PATTERN = "chat:{tenant_id}:{user_id}:{key}"
# Every DynamoDB query scopes to the tenant partition
table.query(
KeyConditionExpression=Key("pk").eq(f"TENANT#{tenant_id}#USER#{user_id}")
)
For a deep dive on Cognito-based authentication that integrates with this pattern, see AWS Cognito User Authentication: An Architecture Deep-Dive.
Per-Tenant Rate Limiting
Rate limits protect both the system and the budget. Without them, a single enthusiastic user can exhaust the LLM API quota for the entire platform. I use a token bucket algorithm implemented in Redis, with bucket sizes configured per tenant tier.
| Tenant Tier | Messages/Minute | Messages/Hour | Max Concurrent Streams | Token Budget/Month |
|---|---|---|---|---|
| Free | 10 | 100 | 1 | 100,000 |
| Professional | 30 | 500 | 3 | 1,000,000 |
| Enterprise | 60 | 2,000 | 10 | 10,000,000 |
| Unlimited | 120 | No limit | 25 | No limit |
The token budget is the outer boundary. Even if a tenant stays within their per-minute rate limit, once they exhaust their monthly token allocation, the system returns a clear error message directing them to upgrade or wait for the next billing cycle. This prevents bill shock on both sides.
class TenantRateLimiter:
async def allow(self, tenant_id: str, user_id: str) -> bool:
key = f"ratelimit:{tenant_id}:{user_id}"
pipe = self.redis.pipeline()
# Token bucket: add tokens at configured rate, cap at bucket size
now = time.time()
pipe.hget(key, "tokens")
pipe.hget(key, "last_refill")
tokens, last_refill = await pipe.execute()
bucket = self.get_tenant_bucket(tenant_id)
elapsed = now - float(last_refill or now)
current = min(
bucket.max_tokens,
float(tokens or bucket.max_tokens) + elapsed * bucket.refill_rate
)
if current < 1:
return False
await self.redis.hset(key, mapping={
"tokens": current - 1,
"last_refill": now,
})
return True
Cost Control
Token Pricing Reality
LLM APIs charge per token. At enterprise scale, this adds up fast. Here is what a 10,000-user chatbot actually costs, based on real deployment numbers.
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Avg. Cost per Conversation | 10K Users, 5 Convos/Day |
|---|---|---|---|---|
| Claude Opus 4.6 | $15.00 | $75.00 | $0.18 | $9,000/day |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.036 | $1,800/day |
| Claude Haiku 4.5 | $0.80 | $4.00 | $0.010 | $500/day |
| GPT-4o | $2.50 | $10.00 | $0.025 | $1,250/day |
| GPT-4o-mini | $0.15 | $0.60 | $0.002 | $75/day |
Those numbers assume 500 input tokens and 200 output tokens per exchange, 4 exchanges per conversation. The Opus column should make every engineering manager's eyes water. The difference between Opus and Haiku for a customer support chatbot is $8,500 per day, and for most customer support use cases, Haiku handles the work just fine.
Caching Strategies
Three cache tiers reduce LLM API costs by 40-80% in my deployments.
Exact match cache. Identical prompts (same user message, same system prompt, same conversation context) return a cached response instantly. Hit rate: 5-15% in support chatbots, higher for FAQ-heavy deployments.
Semantic cache. Messages with similar meaning (e.g., "How do I reset my password?" and "I forgot my password, how do I reset it?") return the same cached response. Implementation: embed the user message with a fast embedding model, search Redis for vectors within a cosine similarity threshold of 0.95. Hit rate: 15-30% in support chatbots.
Prompt caching (provider-level). Anthropic and OpenAI both offer prompt caching that discounts repeated prefixes (system prompts, few-shot examples). Anthropic charges 10% of the standard rate for cached input tokens. For chatbots with long system prompts, this alone cuts input costs by 80-90%.
Model Routing
The smartest cost optimization is sending each message to the cheapest model that can handle it. A simple classifier (itself running on a cheap model) categorizes incoming messages as "simple" (FAQ, greeting, basic lookup), "moderate" (multi-step reasoning, summarization), or "complex" (analysis, code generation, creative writing). Simple messages route to Haiku. Moderate messages route to Sonnet. Complex messages route to Opus. This tiered routing typically saves 60-70% compared to sending everything to a single premium model.
Observability and Monitoring
What to Log
Every LLM interaction generates structured telemetry. Skip any of these fields and you will regret it during the first production incident.
| Metric | Source | Why It Matters |
|---|---|---|
| Latency (time to first token) | FastAPI middleware | Users notice delays over 500ms |
| Latency (total generation) | LLM gateway | Slow generations indicate model issues |
| Token count (input/output) | LLM API response | Cost tracking, budget enforcement |
| Cache hit/miss | Prompt cache | Validates caching strategy effectiveness |
| Error rate by type | Exception handler | Distinguishes API errors from application bugs |
| Active WebSocket connections | Connection manager | Capacity planning, leak detection |
| Messages per second | Rate limiter | Traffic pattern analysis, abuse detection |
| Model selection distribution | Model router | Validates routing classifier accuracy |
| Tenant-level usage | All layers | Billing, quota enforcement, capacity planning |
| Conversation length | History store | Context window management, summarization triggers |
Tracing Multi-Step Interactions
A single user message can trigger a chain of operations: embedding lookup, knowledge base retrieval, prompt assembly, model call, response caching, history persistence. Without distributed tracing, debugging a slow response means guessing which step took too long.
I instrument every step with OpenTelemetry spans. Each span records the operation name, duration, and relevant metadata (model name, token count, cache status). The entire chain rolls up under a single trace ID tied to the user's message ID. Langfuse and Datadog both ingest these traces and provide dashboards for latency percentiles, cost breakdown by tenant, and error attribution.
from opentelemetry import trace
tracer = trace.get_tracer("chatbot.llm")
async def stream_llm_response(websocket, tenant_id, user_id, message):
with tracer.start_as_current_span("llm.generate") as span:
span.set_attribute("tenant.id", tenant_id)
span.set_attribute("model", get_tenant_model(tenant_id))
with tracer.start_as_current_span("history.fetch"):
conversation = await history.get_recent(tenant_id, user_id)
with tracer.start_as_current_span("cache.check"):
cached = await prompt_cache.get(message["content"], tenant_id)
span.set_attribute("cache.hit", cached is not None)
if not cached:
with tracer.start_as_current_span("llm.stream"):
async for chunk in llm_gateway.stream(...):
await websocket.send_json({"type": "token", "content": chunk})
The Demo Application
A companion repository demonstrates the patterns described in this article. The application implements a fully functional chatbot with React, FastAPI, and WebSocket streaming.
Repository Structure
enterprise-chatbot-demo/
├── backend/
│ ├── app/
│ │ ├── main.py # FastAPI app + WebSocket endpoint
│ │ ├── auth.py # JWT validation
│ │ ├── connection_manager.py # WebSocket connection tracking
│ │ ├── llm_gateway.py # LLM provider abstraction
│ │ ├── rate_limiter.py # Token bucket rate limiting
│ │ ├── conversation.py # History storage (Redis + SQLite)
│ │ └── config.py # Environment configuration
│ ├── requirements.txt
│ └── Dockerfile
├── frontend/
│ ├── src/
│ │ ├── App.tsx
│ │ ├── components/
│ │ │ ├── ChatWindow.tsx # Main chat container
│ │ │ ├── MessageList.tsx # Scrollable message area
│ │ │ ├── Message.tsx # Individual message bubble
│ │ │ ├── InputArea.tsx # Text input + controls
│ │ │ └── TypingIndicator.tsx
│ │ ├── hooks/
│ │ │ ├── useWebSocket.ts # Connection management
│ │ │ └── useChat.ts # Chat state reducer
│ │ └── context/
│ │ └── WebSocketContext.tsx
│ ├── package.json
│ └── Dockerfile
├── docker-compose.yml # Full stack: frontend, backend, Redis
└── README.md
Running Locally
git clone https://github.com/CharlesSieg/enterprise-chatbot-demo.git
cd enterprise-chatbot-demo
cp .env.example .env
# Add your ANTHROPIC_API_KEY to .env
docker-compose up
The application starts three containers: the React frontend on port 3000, the FastAPI backend on port 8000, and Redis on port 6379. Open http://localhost:3000 and start chatting. The demo uses Claude Haiku by default to keep costs minimal.
Key Implementation Details
The demo uses SQLite instead of DynamoDB for conversation history (no AWS account required) and includes a mock authentication layer that issues JWTs locally. The rate limiter, connection manager, and streaming pipeline match the production architecture described in this article. Swap SQLite for DynamoDB and the mock auth for Cognito, and the demo becomes production infrastructure.
Key Patterns
- Use WebSockets for enterprise chatbots. The bidirectional communication pays for itself in mid-stream cancellation savings alone. SSE works for simpler use cases; enterprise deployments need the control WebSockets provide.
- Implement cancellation from day one. Users cancel generations constantly. Without explicit cancel handling, you burn tokens on responses nobody reads. The WebSocket cancel frame propagates instantly; SSE requires the server to notice a dropped connection.
- Layer your storage. Redis for hot data (active sessions, caches). DynamoDB for durable history. S3 for attachments. Each layer optimized for its access pattern, each layer independently scalable.
- Route by model tier. Sending everything to the most expensive model is lazy engineering. A simple classifier routing messages to the cheapest capable model saves 60-70% on LLM costs without meaningful quality degradation for most queries.
- Cache aggressively. Exact match, semantic, and provider-level prompt caching stack together. Combined, they reduce LLM API calls by 40-80% in support chatbot deployments.
- Isolate tenants at every layer. Tenant-prefixed Redis keys, tenant-partitioned DynamoDB tables, tenant-scoped rate limits. Cross-tenant data leakage in an enterprise chatbot is a career-ending incident.
- Buffer streaming renders. Re-rendering Markdown on every token at 80 tokens per second destroys UI performance. A 50-millisecond buffer drops render frequency to 20 FPS: smooth and efficient.
- Trace everything. Every LLM call, every cache check, every rate limit decision. OpenTelemetry spans with tenant and model attributes. You cannot optimize what you do not measure.
Additional Resources
- FastAPI WebSockets Documentation
- Anthropic Claude API Streaming
- OpenAI Streaming Guide
- Redis Pub/Sub for WebSocket Scaling
- AWS Architecture Blog: Managing Chat History at Scale
- OpenTelemetry LLM Observability
- Langfuse: Open-Source LLM Observability
- Render: Building Real-Time AI Chat Infrastructure
- Token Bucket Algorithm Explained
Let's Build Something!
I help teams ship cloud infrastructure that actually works at scale. Whether you're modernizing a legacy platform, designing a multi-region architecture from scratch, or figuring out how AI fits into your engineering workflow, I've seen your problem before. Let me help.
Currently taking on select consulting engagements through Vantalect.

