What is LiquidCache? – Xiangpeng’s blog

Acknowledgments

This work is supported by funding from:
1. InfluxData, Bauplan, and SpiralDB.
2. The taxpayers of the State of Wisconsin and the federal government.

Your support for science is greatly appreciated!

LiquidCache is a caching layer that unifies the design goals of compute and storage¹.

¹ Check out our research paper (VLDB 2025) for more technical details.

It accelerates query performance without needing to leave Parquet.

It addresses this fundamental tension:

Storage systems want to optimize for ecosystem compatibility, and long-term, stable, open governance; industry thus gravitated towards Parquet as the de facto columnar format.
Query engines want the data to be optimized for flexible layouts, rapid evolution, and performance-first optimizations.

Instead of squeezing the last bits of performance from Parquet², or trying to create future-proof file formats³, LiquidCache addresses this problem through a new abstraction: the caching layer.

² Great paper on Parquet selection pushdown: Selection Pushdown in Column Stores using Bit Manipulation Instructions.

³ AnyBlox: A Framework for Self-Decoding Datasets, and F3: The Open-Source Data File Format for the Future

At a glance, LiquidCache is a distributed caching service: it supports all object storage backends (S3, GCS, Azure Blob Storage, etc.), and serves all kinds of applications (knowledge bases, dashboards, etc.) deployed on all kinds of compute (Kubernetes, Lambda, etc.).
Under the hood, LiquidCache caches Parquet as liquid data, which is ultra-optimized for compute pushdown, compressed execution, modern storage, and network‑efficient data transfer.

LiquidCache overview. It caches different object store sources and serves different analytical applications.

It is built on open standards: Parquet for data storage, DataFusion as the query engine, and Arrow Flight for data transfer. This makes LiquidCache highly composable – you can easily integrate it into your existing analytics stack.

Why LiquidCache?

We like S3

Simple durability: 11 nines of durability—you never have to worry about data loss.
Simple scalability: virtually unlimited space and throughput.

But S3 is slow and expensive

≈100 ms first‑byte latency plus transfer latency; this quickly adds up when multiple round‑trips are needed to fetch data.⁴
Storage, request, and data‑transfer/egress costs; prices have remained largely unchanged for a decade even as underlying hardware has become ~20× cheaper.

⁴ Exploiting Cloud Object Storage for High-Performance Analytics

S3 prices have barely changed for a decade, despite ~20× reductions in underlying hardware costs, credit to Andrew Lamb

LiquidCache: foundation of diskless architectures

Caches are everywhere⁵: compute‑local caches (e.g., Snowflake/Databricks local NVMe, Spark host caches)⁶, shared‑nothing caches, and cache services⁷.
DLC trilemma: among durability, low latency, and low cost, you can only choose two⁸.

⁵ The Five-Minute Rule for the Cloud: Caching in Analytics Systems

⁶ DuckDB’s external file cache

⁷ ClickHouse’s distributed cache for S3

⁸ The Cloud Storage Triad: Latency, Cost, Durability

How LiquidCache Works

We like Parquet

All major query engines support it (DataFusion, Spark, Trino, DuckDB, Snowflake, BigQuery, and more).
It is battle‑tested and keeps evolving (e.g., page indexes, new encodings).
It is under open, stable governance (Apache Software Foundation), so your data is in good hands.

But sometimes we want more aggressive performance

There are better encodings and compression schemes out there.
Parquet is critical data infrastructure: it evolves cautiously to keep your data safe and stable—it can’t try new research today and abandon your data tomorrow.

LiquidCache: cache-only, pushdown-optimized data representation

LiquidCache uses state‑of‑the‑art encodings and compression chosen by the workload.⁹
Liquid data is invisible to the rest of the ecosystem: it is cache‑only. This means it can freely change its layout, adding or removing encodings without breaking any user code.
LiquidCache transparently, progressively, and selectively transcodes Parquet data to the liquid format.
Liquid data is designed for efficient pushdown to save both compute and network resources.

⁹ The liquid format is heavily inspired by Vortex. We plan to support a Vortex backend in the future.

Without any changes to Parquet, LiquidCache takes care of the performance optimizations.

Conclusions

LiquidCache is the one‑stop shop for diskless, serverless, and pushdown‑native analytics.

It is built on open standards (Parquet, Arrow Flight, DataFusion) for easy integration and stable governance.

LiquidCache caches Parquet as liquid data, which is ultra-optimized for compute pushdown, compressed execution, modern storage, and network‑efficient data transfer.

Who are we?

LiquidCache started as a research project led by Xiangpeng Hao at UW‑Madison ADSL.
It was made possible by a research gift from InfluxData. One year later, SpiralDB and Bauplan also joined the journey.¹⁰
LiquidCache will remain a public‑benefit project in appreciation of the support from taxpayers, research gifts, and the open‑source community.

¹⁰ Support our research here!