karloscodes

Aggregate at write time

· 4 min read

ClickHouse is short for Clickstream Data Warehouse. It was built at Yandex to power Metrica, their web analytics platform, which grew from 200 million events a day to more than 25 billion by 2016. They open sourced it, it’s a marvel, and it has become the reflex: need analytics, need reports, store the raw stream and query it later.

Here’s the detail buried in their own history. Metrica ran for years on pre-built aggregates. Yandex moved to querying raw events because, at their scale, the combinations of reports people wanted exploded past what rollups could hold. ClickHouse exists because Yandex outgrew aggregation. The threshold is measured in billions of events a day.

At that scale, the warehouse makes sense. At your scale, most reports your product produces never need one. The move that replaces it: aggregate at write time. Keep one row per metric per hour, and when an event arrives, bump its counter instead of storing the event.

INSERT INTO hourly_views (hour, views)
VALUES (date_trunc('hour', now()), 1)
ON CONFLICT (hour) DO UPDATE SET views = views + 1;

That’s the whole trick. The report isn’t computed when someone opens it. It was built all along, one increment per event, by the same write that recorded it. The raw event is gone, and the answer is already sitting on disk.

That skips the part the pitch leans on hardest. The warehouse doesn’t just store your data, it promises a generic engine: pipe every raw event in, often through something like Kafka, and query it any way you want later. That power is the real product, and it’s the deeper trap. Generic querying is for when you don’t know what you’ll ask. You do. The reports your product has to produce, a usage summary for you or a stats page for your customers, were designed up front and rarely change. You’re not running a research lab, you’re rendering the same reports on a schedule. Renting a generic engine to answer fixed questions is paying twice, the pipeline that feeds it and the engine that serves it, for a number a pre-computed total hands you with one SELECT.

So compute the answer once, because you can. You know your questions.

Now count the rows. Aggregated by day, a year of any report is 365 rows. By hour, 8,760. Track ten reports at hourly grain and a full year is under 90,000 tiny rows, fewer than one busy afternoon of raw events.

Look at the size of it. A healthy small product does 10,000 tracked events a day. Stored raw, that’s 3.65 million rows a year, most of a gigabyte, growing forever, re-scanned every time a report runs. Aggregated hourly, it’s 8,760 rows. About a megabyte. A thousand times less storage. Four hundred times fewer rows. The same reports, instant. The work happens once, when the event lands, and never again.

I run my analytics this way: hourly rollups in SQLite. Six months of production traffic came to 14 megabytes. Every chart is a read of pre-computed rows, instant on a $5 VPS.

There’s a second prize. Raw events stored forever are pure liability: a bill with a breach radius. Aggregates are small by construction and anonymous by arithmetic. The privacy policy writes itself.

You are not Yandex. Be glad. They need the generic engine: their questions are unknowable and their data has nine zeros. Yours fit on a phone. So skip the machine they’re stuck running and put the saved time and money into the product.

That’s the decision, and it isn’t close. Aggregated reports are a competitive advantage, and you get it for being small and refusing to go generic. Simple wins.

Store answers, not questions.

New posts in your inbox. Unsubscribe anytime.

Or grab the RSS feed.