From Static to Continuous: A Design Pattern for Real-Time ML Systems

There's a structural choice in ML system design that gets made early, usually implicitly, and then dictates the entire shape of the resulting product. The choice is: does the model's state update between user interactions, or in batch after the session ends?

The batch-update approach is the default. It's how most production ML systems work — including the ones at scale. User actions get logged. Overnight (or every few hours), a job runs that pulls the day's logs, recomputes embeddings or scores or features, and writes them back. The model the user sees at noon Tuesday is the model that was built from data up through midnight Monday.

The continuous-update approach is rarer but increasingly viable. The model's state is updated in process memory, in real time, as user actions occur. The state at noon Tuesday incorporates everything up to noon Tuesday, including the click the user made thirty seconds ago. There is no batch job. The model is always current.

The difference between these two architectures is not subtle. They're built with different tools, staffed by different engineers, optimized for different metrics, and unlock different product capabilities. The choice between them is one of the most consequential design decisions in a production ML system, and most teams don't realize they're making it.

The Batch Default

Why is batch the default? Because for most of the history of production ML, it was the only option that scaled. Updating a model in real time requires the model to be small enough to fit in active memory, the update operation to be efficient enough to run inline with user requests, and the engineering team to have built (or bought) a serving stack that supports stateful in-process inference.

Until roughly five years ago, that combination was rare. Most production ML lived on Spark clusters, or in periodic Airflow DAGs, or in batch Beam jobs running on Dataflow. The architecture assumed that retraining and re-scoring happened on a schedule, not on demand. The user-facing serving layer pulled pre-computed scores out of Redis or DynamoDB and rendered them.

This architecture works fine for use cases where the freshness of the model matters less than the throughput. Product recommendations on a retail site, for example: a model that's eight hours stale is not noticeably worse than a model that's eight seconds stale, because the user's preferences don't change that fast and the volume of inference is what matters. The same is true for spam classification, fraud detection at the daily-summary level, marketing-segment scoring, and a long tail of other production ML use cases.

But there's a category of use cases where the batch-update default is catastrophic, and that's where the design choice starts to matter.

When Batch Breaks

The batch architecture breaks in any system where the current user action should causally affect the next decision the system makes for the same user. The clearest examples:

Conversational AI. If the model's state doesn't incorporate the last message the user sent, every turn feels disconnected. The whole point of a conversation is that turn N+1 reflects turn N. Batch updates are unusable here. (LLM systems sidestep this by holding the conversation history in the prompt, but that's a workaround, not a solution.)
Adaptive learning. If the learner's proficiency model doesn't incorporate their last answer, the next question can't actually use it. The system reverts to a static-content delivery loop. (I've written about this specific failure mode at length elsewhere.)
Real-time game opponents. If the AI opponent's belief state doesn't update on the player's last move, the game becomes mechanical. The opponent is reactive in name only.
Interactive trading systems. If the model's view of market state doesn't update on the last trade, the system is making decisions on yesterday's information. This is why HFT systems use FPGAs and bare metal — the latency requirement is non-negotiable.
Active recommendation systems. Distinct from passive product recommendations, an active recommender (Spotify's autoplay, YouTube's "what's next") needs to incorporate what the user is doing right now to choose what to surface next. Otherwise it's just a playlist.

For these cases, the batch architecture is not slower — it's wrong. It produces a categorically different product. The user experience of a continuously-updated system feels alive; the user experience of a batch-updated system feels recorded.

The Continuous-Update Architecture

The architectural pattern for continuous updates has settled out into a recognizable shape over the last few years. The core elements:

Stateful in-process inference. The model lives in process memory for the duration of a user session (or longer). It is not pickled to a database between requests. The per-user state vector is a tensor that gets updated in place.
A long-lived inference server with stickiness. User requests are routed back to the same server instance whenever possible, so the in-memory state doesn't need to be re-hydrated on every request. This is the opposite of the stateless-HTTP design that dominates the rest of the web stack, and it requires routing-layer support (a sticky-session load balancer or a dedicated routing service).
A differentiable update operation. The "update the state given a new observation" operation is a forward-and-backward pass through a small model, not a SQL UPDATE statement. This means PyTorch, JAX, ONNX Runtime, or equivalent — and it means the engineering team needs ML systems expertise, not just web-app expertise.
A persistence story that doesn't dominate the latency budget. The state has to survive process restarts, but the persistence write can't be in the request critical path. Typically this means write-behind to S3 or to a journal, with the in-memory state as the source of truth during the session.
A latency budget measured in milliseconds, not seconds. Once you commit to continuous updates, the whole stack has to be designed around the assumption that the update happens inline with the user interaction. If any layer of the stack adds 100 milliseconds of overhead, the architecture fails.

None of these elements is novel. All of them have been deployed at scale in production. What's novel is combining them in a single coherent system, and treating "the model is always current" as a first-order design constraint.

The Cultural Shift

The bigger thing that happens when a team moves from batch to continuous is cultural. The questions the team asks change.

In a batch shop, the operational question of the day is "did the nightly job finish?" In a continuous shop, the operational question is "what's our P99 update latency?" The first is a binary completion check; the second is a continuous performance metric. The mindset of the engineering team has to shift from "we ran the model once and now we serve it" to "we are running the model right now, and we will keep running it forever."

In a batch shop, model improvements get tested by retraining on a snapshot, comparing offline metrics, and shipping the better model in the next overnight job. In a continuous shop, model improvements get tested by shadow inference — running the new model alongside the old model in production, comparing the predictions in real time, and switching over when the new one is provably better.

In a batch shop, hardware is sized for the retraining job. In a continuous shop, hardware is sized for the steady-state inference load. These are completely different cost curves.

In a batch shop, a bug in the model gets caught after-the-fact, when the next morning's metrics look weird. In a continuous shop, a bug in the model gets caught in real time, when the live error rate spikes — but it's also affecting users right now. The blast radius of a bad model deploy is much larger, and the rollback story has to be cleaner.

Every one of these shifts is uncomfortable for an engineering organization that grew up in the batch world. The teams that successfully make the shift tend to have a small number of engineers who came from the real-time-systems world — trading, ad-tech, real-time games, telecom — who bring the latency-first instinct with them. The teams that struggle tend to be web shops that bolted ML onto a CRUD app and now want continuous behavior without rebuilding the substrate.

When You Should Make the Switch

Not every ML system needs continuous updates. Most production recommenders are fine on the batch substrate. Spam classifiers, fraud screens, marketing segmentation, content moderation — batch is fine.

The signal that you should be on the continuous substrate is when the user-facing product promises responsiveness to the current interaction, but the architecture can't deliver it. The symptoms are recognizable:

The product team wants "real-time adaptation" but engineering ships a nightly batch job and hopes nobody notices.
The latency budget for the "live" experience keeps creeping up. First it was 100ms; then 500ms; now 2 seconds.
The engineering team is layering caches on top of the batch results to fake responsiveness, and the caches are constantly stale.
The product is described as "smart" or "adaptive" but the user experience doesn't actually feel that way.

If those symptoms are present, the architecture is wrong. The fix is structural, not incremental. Layering more caching on top of a batch system doesn't make it continuous; it just makes the staleness less visible. The actual fix is to rebuild the inference substrate to support in-process stateful updates, route requests stickily, and measure success in milliseconds instead of overnight job completion.

That's a multi-quarter engineering effort, not a sprint. The teams that do it correctly come out the other side with a product capability that the batch competitors can't match. The teams that try to retrofit usually give up and ship "we improved the batch frequency to hourly" as a substitute, which fools nobody who's paying attention.

What This Has to Do With Adaptive Learning

The adaptive-learning category is in the middle of this exact transition. Most existing platforms are batch shops with quiz engines on top. A few are starting to ship continuous-update systems. The user experience gap between the two is going to define the next five years of the category.

The same pattern is going to play out across every other ML-driven category over the next decade: real-time conversational interfaces, real-time game AI, real-time content personalization, real-time anything where the user's last action should matter for the next decision. The platforms built on the right substrate from day one will eat the platforms built on the wrong one. The retrofit is hard enough that most batch-architecture incumbents will lose, slowly at first and then all at once.

The choice between batch and continuous is one of those architectural decisions that gets made before anyone realizes how consequential it is. If you're building a system where the user's actions should causally affect what the system does next, make the choice deliberately and early. Retrofitting is harder than building right.

Part of a series on production ML systems and adaptive learning architecture.