2026-04-22
I Built a Database Benchmarking Tool from Scratch. Here's What Happened.
I am finishing my internship at Cape. I wanted to write down what the project actually was, how it went and what I took away from it. Part of it is worth documenting on its own, part of it is that the experience taught me things I do not think I would have learned from another university assignment.
The problem
Cape works with an energy company that collects smart meter readings every 15 minutes from thousands of meters spread across two directions of measurement: consumption (E17) and injection (E18). That sounds manageable until you do the math. At around 1.9 million new rows per day, two years of history lands you somewhere around 1.4 billion rows. All of it was sitting in a PostgreSQL database hosted on Azure and the query performance was slowing down by a lot.
The most visible sign was that CAPE had to upgrade the database from 16 GB to 32 GB of RAM just weeks before we started, which roughly doubled their database cost, because it was running out of memory during normal operations (running queries for their dashboard). The database was 174 GB on disk and the dataset as days go just will keep getting bigger.
The question we were brought in to answer was: should they migrate to a time-series database, and if so, which one?
Why we needed a custom tool
The original plan was straightforward enough: do a literature review, compare the four candidate databases (TimescaleDB, InfluxDB 2, QuestDB, ClickHouse) on a standard set of criteria and write a recommendation. That is what most research projects at university look like.
The problem with that approach is that generic benchmarks like TPC-H or TSBS would tell us how these databases perform on generic workloads. For us to recommend the best possible option, we needed to know the real world performance of these databases with CAPE's actual data. They also did not care much about ingest speeds, since they received information daily in large batches and have more than enough time to ingest it until it will become available for the client. The queries that they run aggregate at multiple levels of a supplier hierarchy, join time-series data against a metadata table with time-validity windows and need to answer both point queries on a single meter and full-dataset aggregations across all thousands of meters. None of the existing benchmarks actually test this exact use case. A real benchmark using real(ish) data would be way more useful to them than just a glorified summary of existing research.
So I built a benchmarking tool from scratch. This was not part of the original plan. I took it upon myself because I kept feeling that whatever we would write in the recommendation section would not be something I could actually stand behind if it was based purely on literature and its findings. The tool itself was never meant to be handed over to the client. It was purely a means to get results I personally trusted.
What I built
The tool has three parts.
Web UI. A FastAPI application running at localhost:8000. The entire frontend is a single embedded HTML endpoint with no build step and no static file server. I exposed it through a Cloudflare tunnel so the whole team could access it from home and from the office without opening firewall ports. The UI has tabs for running individual queries, visualising results as charts, running the full automated benchmark, and monitoring data loading.
CLI runner. A script that runs all 17 queries five times per database via the web application's own API, stores the results as JSON, and gives you a comparison table with best run, cold run and median times.
Setup scripts. For converting the raw source data, generating the full dataset, creating the hierarchy table, and loading everything into each database.
The data pipeline is worth explaining. The client gave us a 4 GB JSON export of their production data. It actually crashed my text editor the first time I opened it and tried to Ctrl+A to check how many characters and lines were in it. Since I wasn't able to work with the file manually, a script had to be written to export that data into a Parquet file that was able to handle such amounts of data. So that JSON file became a 28 MB Parquet seed file after running the script. The seed covers 15 days of energy readings. To simulate two years of production history, a second script replays it 48 times with shifted timestamps and plus/minus 15% random variance on each block, producing 1.3 GB of Parquet files and around 1.4 billion rows of data in total. All five databases were loaded with the same dataset independently and the loading was verified via row count.
One technical problem that deserved its own post
The one I keep coming back to is InfluxDB 3.
We initially wanted to include it. InfluxDB 3 is a complete rewrite of InfluxDB in Rust, uses Apache Arrow and Parquet internally and supports SQL. At first it looked interesting and just like a more modern version of InfluxDB 2 with SQL like capabilities. In practice, it was inconsistent and caused so many issues that in the end I decided to exclude it from benchmarks.
First, I hit a file limit: the free Core edition caps how many Parquet files a single query can scan, and after loading 1.4 billion rows there were 432. Every query failed. So I switched to the Enterprise trial, which lifted the cap. The Enterprise trial was only for 30 days, so having such a deadline to make everything work also wasn't helping.
Then the ingest crashed at 266 million rows. OOM kill, container dead, start over. It happened multiple times. After reconfiguring memory limits every time, WAL flush intervals, compaction parameters and a Parquet cache flag, I eventually got the load through. It ended up running at 20,000 rows per second instead of the 95,000 that was before, which meant the ingest had to be run overnight.
I ran a verification pass, one run per query. Some errored, but enough passed that I thought I might actually be able to benchmark it fully. Then the next day, when I was planning to benchmark it, I ran docker compose up -d --force-recreate to apply a config change. That command created a new container instance. When a new InfluxDB 3 instance starts with an existing node-id, it writes a blank compaction summary at the active pointer position. It silently overwrites 1,882 compaction cycles and the 3,337 already-compacted Parquet files become permanently invisible to the query engine. There was no warning, no recovery path. The last resort I had was a full data reload, again. And maybe there was an update, maybe a config change, but the benchmark was never able to run again successfully. It might've been my fault, but I truly know that I tried my best to fix it. With the deadline near, excluding InfluxDB 3 was the right call. I had also noticed that InfluxDB 2 and InfluxDB 3 ran at equivalent speeds, so the decision to skip v3 was a little less painful.
I believe that the root cause may have been a design flaw: the server accepts any node-id on startup and unconditionally reinitialises the compactor index. There is no guard that says "this node-id was last used by a different instance, are you sure?" There is no repair tool in the CLI. The documentation also does not mention it. The problem only surfaces when queries start OOM-crashing on the simplest possible select.
What the numbers actually said
I ran each of the 17 queries 5 times per database, recording the best run (peak capability) and the cold run (first access, no cache). All databases were given 20 GB of RAM in isolated Docker containers on the same physical machine, benchmarked one at a time.
PostgreSQL is fast on point lookups thanks to its B-tree index, but full-dataset aggregations are brutal at this scale. With 174 GB on disk and only 20 GB of RAM, almost nothing fits in the page cache.
TimescaleDB compresses 174 GB down to 8 GB, a 21x ratio, which makes full-scan aggregations faster than plain PostgreSQL. But decompression overhead flips the result on point queries and on any aggregation that forces simultaneous chunk decompression, where it ends up slower.
ClickHouse is the most consistent performer on analytical queries. It dominates the hierarchy join tiers and is competitive across range and full-scan queries. Its only real weakness is point queries, where the lack of a per-EAN index forces a full scan every time. On the other hand, those queries still took less than a second to complete, but were many times slower than the competition.
QuestDB is the fastest SQL database on point queries (2ms) and peaks at 887ms on Q5, but degrades on the all-time hierarchy joins (96 to 107 seconds).
InfluxDB 2 is the fastest overall on Tiers 1 through 3 but is effectively unusable for join queries. Flux's join implementation materialises both sides of the join in memory and at 1.4 billion rows that took around 3,000 seconds per query.
Best run
Tier 1: Point queries (single meter, time-windowed)
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #1 Single meter: 1-day hourly | 6ms | 355ms | 936ms | 2ms | 5ms |
| #2 Single meter: 1-month daily | 8ms | 351ms | 962ms | 8ms | 6ms |
| #3 Single meter: 1-year monthly | 27ms | 374ms | 949ms | 67ms | 9ms |
Tier 2: Range aggregations (all meters, time-bucketed)
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #4 Monthly E17 vs E18 balance | 102.73s | 57.04s | 6.97s | 12.41s | 2.96s |
| #5 Hourly aggregation (first 3 months) | 33.71s | 45.06s | 2.66s | 887ms | 27.57s |
| #6 Peak consumption hours | 77.02s | 52.42s | 3.30s | 10.01s | 2.48s |
| #7 Daily aggregation by direction | 100.05s | 55.78s | 7.38s | 15.02s | 21.72s |
Tier 3: Full-scan aggregations (no time filter)
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #8 Top 20 meters by total consumption | 61.80s | 4.33s | 3.47s | 4.53s | 930ms |
| #9 Net energy balance per meter | 74.08s | 65.62s | 3.85s | 13.31s | 2.33s |
| #10 Prosumer detection (E18/E17 ratio) | 74.31s | 65.81s | 3.84s | 14.23s | 2.28s |
| #11 Active meters per day | 379.95s | 552.85s | 6.62s | 11.94s | 15.32s |
Tier 4: Hierarchy join queries (with time window)
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #12 Supplier total daily | 158.26s | 160.26s | 13.33s | 49.31s | 3296.26s* |
| #13 By category (PRF/SMA) daily | 167.55s | 174.53s | 13.44s | 46.79s | 3296.90s* |
| #14 Sub-category monthly (PRF) | 130.27s | 127.46s | 11.24s | 35.44s | 2928.04s* |
Tier 5: Hierarchy join queries (all time, no window)
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #15 Supplier total (all time) | 119.38s | 109.27s | 15.40s | 97.93s | 3243.34s* |
| #16 By category all time (PRF/SMA) | 132.16s | 128.63s | 15.37s | 101.27s | 3258.31s* |
| #17 Sub-category all time (PRF only) | 106.55s | 97.26s | 15.08s | 95.83s | 3240.96s* |
Cold run
Tier 1: Point queries
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #1 Single meter: 1-day hourly | 23ms | 409ms | 936ms | 9ms | 11ms |
| #2 Single meter: 1-month daily | 16ms | 366ms | 968ms | 37ms | 19ms |
| #3 Single meter: 1-year monthly | 90ms | 386ms | 949ms | 4.22s | 91ms |
Tier 2: Range aggregations
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #4 Monthly E17 vs E18 balance | 120.13s | 60.94s | 6.97s | 12.41s | 2.99s |
| #5 Hourly aggregation (first 3 months) | 57.55s | 45.30s | 2.71s | 2.20s | 27.81s |
| #6 Peak consumption hours | 78.07s | 52.59s | 3.33s | 11.42s | 2.65s |
| #7 Daily aggregation by direction | 100.05s | 56.64s | 7.38s | 15.93s | 22.16s |
Tier 3: Full-scan aggregations
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #8 Top 20 meters by total consumption | 104.60s | 4.35s | 3.51s | 9.55s | 947ms |
| #9 Net energy balance per meter | 75.43s | 65.62s | 3.87s | 13.31s | 2.44s |
| #10 Prosumer detection (E18/E17 ratio) | 74.44s | 66.27s | 3.87s | 14.79s | 2.38s |
| #11 Active meters per day | 389.15s | 552.85s | 6.66s | 11.94s | 15.47s |
Tier 4: Hierarchy join queries (with time window)
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #12 Supplier total daily | 168.36s | 160.59s | 13.47s | 50.36s | 3296.26s* |
| #13 By category (PRF/SMA) daily | 167.55s | 176.31s | 13.52s | 46.79s | 3296.90s* |
| #14 Sub-category monthly (PRF) | 130.93s | 127.80s | 11.24s | 35.70s | 2928.90s* |
Tier 5: Hierarchy join queries (all time, no window)
| Query | PostgreSQL | TimescaleDB | ClickHouse | QuestDB | InfluxDB 2 |
|---|---|---|---|---|---|
| #15 Supplier total (all time) | 168.36s | 160.59s | 13.47s | 50.36s | 3243.34s* |
| #16 By category all time (PRF/SMA) | 167.55s | 176.31s | 13.52s | 46.79s | 3258.31s* |
| #17 Sub-category all time (PRF only) | 130.93s | 127.80s | 11.24s | 35.70s | 3240.96s* |
* Tier 4 and 5 InfluxDB 2 queries were run once due to extreme execution time; the same result appears in both tables.
Why we didn't recommend the fastest option
InfluxDB 2 had the best raw performance on most individual query types. We did not recommend it as the primary migration target.
The client's dashboard is built around hierarchy aggregations. They need to aggregate meter data by supplier, category and sub-category, joined against a validity-window table that tracks which meters belong to which supplier at any given point in time. InfluxDB 2 cannot do that efficiently due to their "join" implementation. Recommending the fastest option when that option is fundamentally incapable of the most important query type would have been the wrong call.
The final recommendation was ClickHouse for teams where analytical throughput is the primary concern, and TimescaleDB for teams who want the lowest migration risk. TimescaleDB is a PostgreSQL extension that uses the same driver, the same SQL and the same Azure managed service. Continuous aggregates can pre-compute the slow queries down to millisecond response times for known query patterns, which makes the decompression overhead a non-issue for anything that goes through a materialized view.
What the internship taught me
The benchmarking tool taught me that being technically capable is only part of the job. The more important part is making sure you understand what you are actually trying to produce and for whom.
The other thing that stuck with me is how much the real stakes change how seriously you take the work. University assignments have fixed scope and clear deadlines. At Cape, the scope kept shifting as we learned more about the client's actual schema, their infrastructure constraints and their cost expectations. It also meant that the recommendation we produced was going to be acted on by a real team making a real infrastructure decision. That is a different kind of pressure than getting a grade.
I have also now learnt in practice that the gap between what looks nearly done and what is actually done is almost always bigger than it appears. The first 90 percent of the work takes 90 percent of the time. The remaining 10 percent takes another 90 percent.
In the end I feel like I'm leaving the internship with clearer sense of what kind of work I want to do: work where the output matters beyond the submission date and where I am surrounded by people who can tell me what I am getting wrong.
The queries and the benchmarking tool that was used can be found here.