I am presenting a new paper at ARCS 2026 (39th GI/ITG International Conference on Architecture of Computing Systems) in Mainz, Germany, March 24 to 26.
Rebooting Microreboot: Architectural Support for Safe, Parallel Recovery in Microservice Systems
I spent a lot of time in this space before I even started my PhD. In 2012, Jean-Philippe Martin-Flatin and I co-founded LakeMind, a startup that tackled cross-provider cloud diagnosis. The company was too early for the market, but the problem stuck with me. Over the years I kept coming back to it, through many conversations with colleagues at MIT and MPI-SWS, well before agents became mainstream. The idea for this paper came together last year while working on SQL-of-Thought, our natural-language-to-SQL agent framework. I realized that some of the principles we were exploring there (constraining what an agent can express, validating before executing) applied directly to infrastructure remediation.
It feels like I have gone full circle. Back then, we lacked the tools to do what we wanted. The landscape looks very different now: LLM-based agents can diagnose incidents, reason about dependencies, and propose fixes. But that power comes with a real risk. Give an agent raw access to kubectl or cloud APIs and it can easily make a bad situation worse.
Most work in this area lets agents loose on production infrastructure. We take a different approach. Instead of giving agents free rein, we decouple planning from actuation and constrain what the agent can even express. The agent proposes remediation plans using a typed instruction set of seven actions (Restart, Drain, RestoreTraffic, CircuitBreak, RateLimit, Scale, RollbackConfig), each with explicit rollback semantics. A small microkernel validates every plan against the current dependency graph and executes it transactionally. If the plan violates scope or safety constraints, the microkernel rejects it before anything touches production.
This is not a new insight. The original microreboot work by Candea et al. showed twenty years ago that restarting just the failing component is fast and effective. What changed is that modern microservices have dense, dynamic dependency graphs where a naive restart can cascade failures to dozens of other services. And now we have autonomous agents firing remediation commands at machine speed. The combination requires architectural guardrails.
We infer recovery boundaries online from distributed traces, so the system reflects current workload conditions rather than stale deploy-time configuration. The microkernel enforces these boundaries transactionally. Agents remain useful for diagnosis and planning. They just cannot bypass the safety layer.
The numbers back this up. On industrial traces from Alibaba (5,459 services) and Meta (392 services), recovery-group inference runs in 21 ms at P99. In simulation, typed actuation cuts agent-caused harm from 77% down to 4%. In online experiments with fault injection on DeathStarBench, we observe 0% agent-caused harm across five fault types, compared to 90% for unconstrained agents. The point is not speed (LLM overhead means the agent path does not always beat simple auto-restart) but safety: the agent helps without making things worse.
The paper received strong reviews, and I am glad it found a good home. Springer will publish the proceedings in the Lecture Notes in Computer Science (LNCS) series.
