July 7, 2026

Kafka

Apache Kafka performance #1 - linger.ms

July 7, 2026

Kafka

This is the first in an ongoing ad-hoc series of posts on Apache Kafka performance. I have no general direction, I’ll just share interesting insights based on the performance testing I do on Apache Kafka.

Recently I was curious to see if there was any general performance improvement since Kafka 3.X. So I ran a suite of benchmarks with Dimster against 3.7.2 and 4.3.0. I saw two common patterns:

Pattern 1: Low load benchmarks showed that end-to-end latency was higher with Kafka 4.3 compared to 3.7.2. The following is a 45 minute no-record-key workload of 5000 record/s, 20 topics (120 partitions), fan-out 2 (240 consumers), full TLS, on 3 brokers each allocated 8 SMT CPUs in k8s (on my Threadripper 9980X).

Jack Vanlightly

July 1, 2026

Data

1BRC on a Threadripper 9980X

Jack Vanlightly

July 1, 2026

Data

esterday I published some benchmarks of Hardwood 1.0 on my Threadripper. Someone suggested I run the One Billion Row Challenge too, to see how it does, so here it is!

Gunnar Morling ran the original benchmarks on an EPYC 7502P, Zen 2, 32 cores with 128 GB of RAM. The official challenge was on 8 cores (sequentially chosen) plus a bonus of all 32 cores.

I chose to run the benchmark using 9 contenders from the published 8 and 32 core results. The 9 contenders I ran were thomaswue, artsiomkorzun, jerrinot, serkan-ozal, abeobk, stephenvonworley, royvanrijn, mtopolnik, yavuztas.

Jack Vanlightly

June 30, 2026

Data

Benchmarking Hardwood 1.0 on a Threadripper 9980X

Jack Vanlightly

June 30, 2026

Data

Hardwood is a minimal-dependency Java library for reading Parquet files. It currently has row-reader and columnar-reader APIs, with Parquet writing planned for the future.

Gunnar Morling, Hardwood’s author, published some initial benchmarks in the v1.0 announcement, comparing Hardwood’s row and column readers against Parquet Java. Those benchmarks measured read speed against already-downloaded Parquet files.

Gunnar’s benchmarks ran on an m7i.2xlarge, with 8 vCPUs / 4 physical cores. Each test used three variants:

Hardwood with decoder threads = Runtime.getRuntime().availableProcessors(), which equals 8
Hardwood pinned to one CPU thread with taskset
Parquet Java, single-threaded

I was curious how the same benchmarks would look on my Threadripper 9980X: 64 cores / 128 threads, with 256 GB ECC DDR5. I modified Gunnar’s benchmark code to also test Hardwood with fixed decoder-thread counts: 1, 4, and 8.

Jack Vanlightly

June 24, 2026

Kafka

Kafka Share Groups - Pathological fetch waits with record_limit

Jack Vanlightly

June 24, 2026

Kafka

In this post we’re going to see how share.acquire.mode=record_limit combined with fewer consumers than partitions and various cases of “partition skew” can result in subpar performance with share groups.

I stumbled on these issues when running large sets of dimensional tests with Dimster’s explore-limits mode, which finds the highest sustainable throughput while staying within a target end-to-end latency target. There was a specific subset of the tests that explore-limits mode would consistently fail to complete, and they all happened to be with record_limit and a consumer count lower than the partition count. In this test, we’ll understand why Dimster had such a hard time with this combination.

Jack Vanlightly

June 22, 2026

Distributed Systems, Data

Can We Agree on a Storage/Workload Architecture Taxonomy?

Jack Vanlightly

June 22, 2026

Distributed Systems, Data

The lines between transactional systems, analytical systems, hybrid systems, and shared storage architectures are getting blurry. This post proposes a small taxonomy for describing the different ways systems, workloads, storage tiers, visibility, and durable copies relate to each other.

OLTP, OLAP, HTAP, and now LTAP.

We can think of the first two as two types of workload which have specialized query engines and storage systems to support them. OLTP such as the RDBMS like Postgres and MySQL use row-based storage engines. OLAP, such as Clickhouse, cloud data warehouse and the lakehouse use column-based storage.

HTAP is a hybrid workload system: one system -> both transactional and analytical workloads. The HTAP system therefore has specialized storage and specialized query engine to stitch together the row-based and columnar data.

So far, we’re dealing with a single system. A Postgres (OLTP), a Clickhouse (OLAP), a SingleStore or TiDB (HTAP).

So what is LTAP?

Jack Vanlightly

June 19, 2026

AI

Raise the ambition threshold

Jack Vanlightly

June 19, 2026

AI

“Perfection is finally attained not when there is no longer anything to add, but when there is no longer anything to take away.” — Antoine de Saint-Exupéry

AI gives us an unprecedented ability to add. The danger is that we begin to mistake accumulation for value.

Delivery is only the beginning (or be mindful of catabolic collapse)

Every new system and feature adds obligations: it must be operated, secured, monitored, documented, integrated, upgraded and eventually replaced or retired. Hackers love a juicy target, even if it’s that half-forgotten service that people are unsure whether it’s safe to turn off or not. If we respond to “cheaper” software creation by producing far more software, we may accumulate obligations faster than we acquire the capacity to discharge them. Under the weight of the proliferation of software, the organization starts to sacrifice its ability to build what it will need next to react effectively to changing market conditions and opportunities.

This is the dynamic described by catabolic collapse.

Jack Vanlightly

June 10, 2026

Kafka

Kafka Share Groups and Parallelizing Consumption - Part 3: Client-local parallelism

Jack Vanlightly

June 10, 2026

Kafka

In the last post Broker-Visible vs Client-Local Parallelism we looked at two ways of scaling Kafka consumption. The final unit of parallelism can be visible to the broker, as consumers, or it can be local to the client, as threads, virtual threads, async tasks, or some other execution mechanism hidden behind a smaller number of consumers.

Broker-visible parallelism is simple to reason about: if each consumer processes records serially, we add more consumers to increase parallelism. But each consumer adds overhead to the brokers: broker-side protocol state, TCP connections, group membership, fetch state, and participation in the consumer or share group protocol. With long processing times and/or high throughput, the required number of parallel workers can easily exceed what is practical to model as broker-visible consumers.

That is where client-local parallelism becomes important. Instead of scaling by adding more consumers, each consumer application can poll records and process them concurrently inside the client. This allows a smaller number of Kafka consumers to drive a much larger amount of parallel work.

In this post, we’ll compare client-local parallelism with consumer groups and share groups using the Apache Kafka clients, by way of Dimster, the benchmarking tool used throughout this series. Dimster uses the official Apache Kafka clients under the hood. The main comparison is between two styles of client-local parallelism: blocking and continuous styles.

Jack Vanlightly

June 4, 2026

Messaging Systems, Kafka

Broker-Visible vs Client-Local Parallelism

Jack Vanlightly

June 4, 2026

Messaging Systems, Kafka

This post is a little side-quest from my “Kafka Share Groups and Parallelizing Consumption” series.

My “Kafka Share Groups and Parallelizing Consumption” series (part 1, part 2) has been laser focused on how different configurations and behaviors affect parallel consumption in share groups. So far I’ve shown that you most definitely can hold share groups wrong. You could quite easily and inadvertently create a work queue and with the right combination of things going against you, see a small number of consumers dominate, leaving most consumers starved of messages. All the while lag builds and builds. You need to know the settings and what they do.

But it’s worth asking the question: is parallelizing consumption what share groups are for?

Jack Vanlightly

May 27, 2026

Kafka

Kafka Share Groups and Parallelizing Consumption - Part 2: Producer Batches and share.acquire.mode

Jack Vanlightly

May 27, 2026

Kafka

In the last post we used simulated consumer processing time to reveal how important it is to set an appropriate value for max.poll.records. The rule of thumb was a value somewhat lower than:

group.share.partition.max.record.locks / number of consumers per partition

But there’s more to parallel consumption than max.poll.records. The size of producer batches also plays a role when using the default share.acquire.mode (batch_optimized).

Jack Vanlightly

May 25, 2026

Kafka

Kafka Share Groups and Parallelizing Consumption — Part 1: Tuning max.poll.records

Jack Vanlightly

May 25, 2026

Kafka

All tests were executed against Kafka 4.2.0 using Dimster.

In the last post we measured the overhead that the mechanics of share groups adds, and saw that it is pretty small. Likewise we saw that raw throughput was also comparable to consumer groups and even saw it exceed consumer group throughput on one test.

In this post we’re going to simulate processing time in the consumers to make these benchmarks more realistic and show the utility of share groups (namely the ability to parallelize processing beyond the partition count).

We’ll see how the following two configurations play an important role in parallelizing consumption with share groups:

max.poll.records (consumer config)
group.share.partition.max.record.locks (broker-side config)