I am excited to announce that our paper "Cache Saver: A Modular Framework for Efficient, Affordable, and Reproducible LLM Inference" has been accepted at EMNLP 2025!
EMNLP (Empirical Methods in Natural Language Processing) is one of the premier venues for NLP research, taking place November 4-9, 2025 in Suzhou, China.
The Problem
Inference constitutes the majority of costs throughout the lifecycle of a large language model. While numerous LLM inference engines focusing on low-level optimizations have been developed, there is a scarcity of non-intrusive client-side frameworks that perform high-level optimizations.
Our Solution
Cache Saver is a modular, plug-and-play, and asynchronous framework that facilitates high-level inference optimizations. The key novelty is a namespace-aware list-valued cache that ensures statistical integrity of LLM responses by generating i.i.d. responses within a namespace as well as ensuring reproducibility.
Key Results
- On average across all methods, tasks, and LLMs, Cache Saver reduces cost by ~25% and CO2 by ~35%
- In practical ML scenarios such as benchmarking or ablation analysis, we achieve ~60% cost and carbon reduction
- Supports both local and online models without requiring changes to end-user application logic
The source code is available on GitHub.
