2026-04-22

I Built a Database Benchmarking Tool from Scratch. Here's What Happened.

I am finishing my internship at Cape. I wanted to write down what the project actually was, how it went and what I took away from it. Part of it is worth documenting on its own, part of it is that the experience taught me things I do not think I would have learned from another university assignment.

The problem

Cape works with an energy company that collects smart meter readings every 15 minutes from thousands of meters spread across two directions of measurement: consumption (E17) and injection (E18). That sounds manageable until you do the math. At around 1.9 million new rows per day, two years of history lands you somewhere around 1.4 billion rows. All of it was sitting in a PostgreSQL database hosted on Azure and the query performance was slowing down by a lot.

The most visible sign was that CAPE had to upgrade the database from 16 GB to 32 GB of RAM just weeks before we started, which roughly doubled their database cost, because it was running out of memory during normal operations (running queries for their dashboard). The database was 174 GB on disk and the dataset as days go just will keep getting bigger.

The question we were brought in to answer was: should they migrate to a time-series database, and if so, which one?

Why we needed a custom tool

The original plan was straightforward enough: do a literature review, compare the four candidate databases (TimescaleDB, InfluxDB 2, QuestDB, ClickHouse) on a standard set of criteria and write a recommendation. That is what most research projects at university look like.

The problem with that approach is that generic benchmarks like TPC-H or TSBS would tell us how these databases perform on generic workloads. For us to recommend the best possible option, we needed to know the real world performance of these databases with CAPE's actual data. They also did not care much about ingest speeds, since they received information daily in large batches and have more than enough time to ingest it until it will become available for the client. The queries that they run aggregate at multiple levels of a supplier hierarchy, join time-series data against a metadata table with time-validity windows and need to answer both point queries on a single meter and full-dataset aggregations across all thousands of meters. None of the existing benchmarks actually test this exact use case. A real benchmark using real(ish) data would be way more useful to them than just a glorified summary of existing research.

So I built a benchmarking tool from scratch. This was not part of the original plan. I took it upon myself because I kept feeling that whatever we would write in the recommendation section would not be something I could actually stand behind if it was based purely on literature and its findings. The tool itself was never meant to be handed over to the client. It was purely a means to get results I personally trusted.

What I built

The tool has three parts.

Web UI. A FastAPI application running at localhost:8000. The entire frontend is a single embedded HTML endpoint with no build step and no static file server. I exposed it through a Cloudflare tunnel so the whole team could access it from home and from the office without opening firewall ports. The UI has tabs for running individual queries, visualising results as charts, running the full automated benchmark, and monitoring data loading.

CLI runner. A script that runs all 17 queries five times per database via the web application's own API, stores the results as JSON, and gives you a comparison table with best run, cold run and median times.

Setup scripts. For converting the raw source data, generating the full dataset, creating the hierarchy table, and loading everything into each database.

The data pipeline is worth explaining. The client gave us a 4 GB JSON export of their production data. It actually crashed my text editor the first time I opened it and tried to Ctrl+A to check how many characters and lines were in it. Since I wasn't able to work with the file manually, a script had to be written to export that data into a Parquet file that was able to handle such amounts of data. So that JSON file became a 28 MB Parquet seed file after running the script. The seed covers 15 days of energy readings. To simulate two years of production history, a second script replays it 48 times with shifted timestamps and plus/minus 15% random variance on each block, producing 1.3 GB of Parquet files and around 1.4 billion rows of data in total. All five databases were loaded with the same dataset independently and the loading was verified via row count.

One technical problem that deserved its own post

The one I keep coming back to is InfluxDB 3.

We initially wanted to include it. InfluxDB 3 is a complete rewrite of InfluxDB in Rust, uses Apache Arrow and Parquet internally and supports SQL. At first it looked interesting and just like a more modern version of InfluxDB 2 with SQL like capabilities. In practice, it was inconsistent and caused so many issues that in the end I decided to exclude it from benchmarks.

First, I hit a file limit: the free Core edition caps how many Parquet files a single query can scan, and after loading 1.4 billion rows there were 432. Every query failed. So I switched to the Enterprise trial, which lifted the cap. The Enterprise trial was only for 30 days, so having such a deadline to make everything work also wasn't helping.

Then the ingest crashed at 266 million rows. OOM kill, container dead, start over. It happened multiple times. After reconfiguring memory limits every time, WAL flush intervals, compaction parameters and a Parquet cache flag, I eventually got the load through. It ended up running at 20,000 rows per second instead of the 95,000 that was before, which meant the ingest had to be run overnight.

I ran a verification pass, one run per query. Some errored, but enough passed that I thought I might actually be able to benchmark it fully. Then the next day, when I was planning to benchmark it, I ran docker compose up -d --force-recreate to apply a config change. That command created a new container instance. When a new InfluxDB 3 instance starts with an existing node-id, it writes a blank compaction summary at the active pointer position. It silently overwrites 1,882 compaction cycles and the 3,337 already-compacted Parquet files become permanently invisible to the query engine. There was no warning, no recovery path. The last resort I had was a full data reload, again. And maybe there was an update, maybe a config change, but the benchmark was never able to run again successfully. It might've been my fault, but I truly know that I tried my best to fix it. With the deadline near, excluding InfluxDB 3 was the right call. I had also noticed that InfluxDB 2 and InfluxDB 3 ran at equivalent speeds, so the decision to skip v3 was a little less painful.

I believe that the root cause may have been a design flaw: the server accepts any node-id on startup and unconditionally reinitialises the compactor index. There is no guard that says "this node-id was last used by a different instance, are you sure?" There is no repair tool in the CLI. The documentation also does not mention it. The problem only surfaces when queries start OOM-crashing on the simplest possible select.

What the numbers actually said

I ran each of the 17 queries 5 times per database, recording the best run (peak capability) and the cold run (first access, no cache). All databases were given 20 GB of RAM in isolated Docker containers on the same physical machine, benchmarked one at a time.

PostgreSQL is fast on point lookups thanks to its B-tree index, but full-dataset aggregations are brutal at this scale. With 174 GB on disk and only 20 GB of RAM, almost nothing fits in the page cache.

TimescaleDB compresses 174 GB down to 8 GB, a 21x ratio, which makes full-scan aggregations faster than plain PostgreSQL. But decompression overhead flips the result on point queries and on any aggregation that forces simultaneous chunk decompression, where it ends up slower.

ClickHouse is the most consistent performer on analytical queries. It dominates the hierarchy join tiers and is competitive across range and full-scan queries. Its only real weakness is point queries, where the lack of a per-EAN index forces a full scan every time. On the other hand, those queries still took less than a second to complete, but were many times slower than the competition.

QuestDB is the fastest SQL database on point queries (2ms) and peaks at 887ms on Q5, but degrades on the all-time hierarchy joins (96 to 107 seconds).

InfluxDB 2 is the fastest overall on Tiers 1 through 3 but is effectively unusable for join queries. Flux's join implementation materialises both sides of the join in memory and at 1.4 billion rows that took around 3,000 seconds per query.

Best run

Tier 1: Point queries (single meter, time-windowed)

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#1 Single meter: 1-day hourly	6ms	355ms	936ms	2ms	5ms
#2 Single meter: 1-month daily	8ms	351ms	962ms	8ms	6ms
#3 Single meter: 1-year monthly	27ms	374ms	949ms	67ms	9ms

Tier 2: Range aggregations (all meters, time-bucketed)

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#4 Monthly E17 vs E18 balance	102.73s	57.04s	6.97s	12.41s	2.96s
#5 Hourly aggregation (first 3 months)	33.71s	45.06s	2.66s	887ms	27.57s
#6 Peak consumption hours	77.02s	52.42s	3.30s	10.01s	2.48s
#7 Daily aggregation by direction	100.05s	55.78s	7.38s	15.02s	21.72s

Tier 3: Full-scan aggregations (no time filter)

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#8 Top 20 meters by total consumption	61.80s	4.33s	3.47s	4.53s	930ms
#9 Net energy balance per meter	74.08s	65.62s	3.85s	13.31s	2.33s
#10 Prosumer detection (E18/E17 ratio)	74.31s	65.81s	3.84s	14.23s	2.28s
#11 Active meters per day	379.95s	552.85s	6.62s	11.94s	15.32s

Tier 4: Hierarchy join queries (with time window)

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#12 Supplier total daily	158.26s	160.26s	13.33s	49.31s	3296.26s^*
#13 By category (PRF/SMA) daily	167.55s	174.53s	13.44s	46.79s	3296.90s^*
#14 Sub-category monthly (PRF)	130.27s	127.46s	11.24s	35.44s	2928.04s^*

Tier 5: Hierarchy join queries (all time, no window)

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#15 Supplier total (all time)	119.38s	109.27s	15.40s	97.93s	3243.34s^*
#16 By category all time (PRF/SMA)	132.16s	128.63s	15.37s	101.27s	3258.31s^*
#17 Sub-category all time (PRF only)	106.55s	97.26s	15.08s	95.83s	3240.96s^*

Cold run

Tier 1: Point queries

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#1 Single meter: 1-day hourly	23ms	409ms	936ms	9ms	11ms
#2 Single meter: 1-month daily	16ms	366ms	968ms	37ms	19ms
#3 Single meter: 1-year monthly	90ms	386ms	949ms	4.22s	91ms

Tier 2: Range aggregations

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#4 Monthly E17 vs E18 balance	120.13s	60.94s	6.97s	12.41s	2.99s
#5 Hourly aggregation (first 3 months)	57.55s	45.30s	2.71s	2.20s	27.81s
#6 Peak consumption hours	78.07s	52.59s	3.33s	11.42s	2.65s
#7 Daily aggregation by direction	100.05s	56.64s	7.38s	15.93s	22.16s

Tier 3: Full-scan aggregations

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#8 Top 20 meters by total consumption	104.60s	4.35s	3.51s	9.55s	947ms
#9 Net energy balance per meter	75.43s	65.62s	3.87s	13.31s	2.44s
#10 Prosumer detection (E18/E17 ratio)	74.44s	66.27s	3.87s	14.79s	2.38s
#11 Active meters per day	389.15s	552.85s	6.66s	11.94s	15.47s

Tier 4: Hierarchy join queries (with time window)

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#12 Supplier total daily	168.36s	160.59s	13.47s	50.36s	3296.26s^*
#13 By category (PRF/SMA) daily	167.55s	176.31s	13.52s	46.79s	3296.90s^*
#14 Sub-category monthly (PRF)	130.93s	127.80s	11.24s	35.70s	2928.90s^*

Tier 5: Hierarchy join queries (all time, no window)

Query	PostgreSQL	TimescaleDB	ClickHouse	QuestDB	InfluxDB 2
#15 Supplier total (all time)	168.36s	160.59s	13.47s	50.36s	3243.34s^*
#16 By category all time (PRF/SMA)	167.55s	176.31s	13.52s	46.79s	3258.31s^*
#17 Sub-category all time (PRF only)	130.93s	127.80s	11.24s	35.70s	3240.96s^*

* Tier 4 and 5 InfluxDB 2 queries were run once due to extreme execution time; the same result appears in both tables.

Why we didn't recommend the fastest option

InfluxDB 2 had the best raw performance on most individual query types. We did not recommend it as the primary migration target.

The client's dashboard is built around hierarchy aggregations. They need to aggregate meter data by supplier, category and sub-category, joined against a validity-window table that tracks which meters belong to which supplier at any given point in time. InfluxDB 2 cannot do that efficiently due to their "join" implementation. Recommending the fastest option when that option is fundamentally incapable of the most important query type would have been the wrong call.

The final recommendation was ClickHouse for teams where analytical throughput is the primary concern, and TimescaleDB for teams who want the lowest migration risk. TimescaleDB is a PostgreSQL extension that uses the same driver, the same SQL and the same Azure managed service. Continuous aggregates can pre-compute the slow queries down to millisecond response times for known query patterns, which makes the decompression overhead a non-issue for anything that goes through a materialized view.

What the internship taught me

The benchmarking tool taught me that being technically capable is only part of the job. The more important part is making sure you understand what you are actually trying to produce and for whom.

The other thing that stuck with me is how much the real stakes change how seriously you take the work. University assignments have fixed scope and clear deadlines. At Cape, the scope kept shifting as we learned more about the client's actual schema, their infrastructure constraints and their cost expectations. It also meant that the recommendation we produced was going to be acted on by a real team making a real infrastructure decision. That is a different kind of pressure than getting a grade.

I have also now learnt in practice that the gap between what looks nearly done and what is actually done is almost always bigger than it appears. The first 90 percent of the work takes 90 percent of the time. The remaining 10 percent takes another 90 percent.

In the end I feel like I'm leaving the internship with clearer sense of what kind of work I want to do: work where the output matters beyond the submission date and where I am surrounded by people who can tell me what I am getting wrong.

The queries and the benchmarking tool that was used can be found here.