Cache Saver Accepted at EMNLP 2025

Our paper "Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference" has been accepted at EMNLP 2025!

EMNLP (Empirical Methods in Natural Language Processing) is one of the premier venues for NLP research, taking place November 4-9, 2025 in Suzhou, China.

The Problem

Inference constitutes the majority of costs throughout the lifecycle of a large language model. While numerous LLM inference engines focusing on low-level optimizations have been developed, there is a scarcity of non-intrusive client-side frameworks that perform high-level optimizations.

Our Solution

Cache Saver is a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations. The key novelty is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating i.i.d. responses within a namespace as well as ensuring reproducibility.

Key Results

On average across all methods, tasks, and LLMs, Cache Saver reduces cost by ~25% and CO2 by ~35%
In practical ML scenarios such as benchmarking or ablation analysis, we achieve ~60% cost and carbon reduction
Supports both local and online models without requiring changes to end-user application logic

The source code is available on GitHub.

This is joint work in collaboration with EPFL, Aarhus University, and Microsoft Research: Nearchos Potamitis (Aarhus University), Lars Henning Klein (EPFL), Bardia Mohammadi (MPI-SWS), Chongyang Xu (MPI-SWS), Attreyee Mukherjee (MPI-SWS), Niket Tandon (Microsoft Research), and Akhil Arora (Aarhus University).

The Problem

Our Solution

Key Results

Resources

Comments

Related Posts

Semantic Caching for OLAP Accepted at DOLAP 2026

Two Papers Accepted at RE4Web3 2025

GNN Partitioning Survey Accepted at GRADES-NDA 2025