Kafka On the Shore: My Experiences Benchmarking Apache Kafka Part II

This is part II of a series on Benchmarking Kafka Part I can be found here:

In the first part we used Spark to blast a 2gb csv file of 10 million rows into a three machine Kafka cluster. We got the speeds down to about 30 seconds. Which means it would take about 4 hours to blast a Terabyte. Which is fast, but not blazing fast.

Any number I put here will become obsolete within a year, perhaps sooner. Nevertheless, I’ll put myself out there. If on modest hardware we could achieve 1 terabyte in 40 minutes that would be enough I think to impress some people. which is about 400mb/s

Now again, because of Kafka’s memory flush cycle. We can only get the speed we want up to 8gb per machine. Really less, because there is some Ram usage by the os itself and any other applications running on those machines, including in my case Spark usage. So conservatively we can try and get 4gb per machine. At 400mb/s for two minutes straight.

Using some tricks, this kind of throughput can be accomplished on pretty modest hardware.

no replication
partitioning

Now the hard part is finding a machine gun that can fire those messages that fast. A distributed solution seems like the best move and replicates real world type of messaging many sources each blasting away messages.

So I fire up a spark instance and load in a large csv file of 10 million rows ~1.8gb. I re-partition the data set to take advantage of the number of cores available to me. And then I run the mapPartitions function, which allows each partition to independently of all others blast kafka with all of it’s messages, eliminating much of the overhead.

I then get a sustained message blast of about 200mb without the machine falling over.

[my interest in benchmarking kafka has been temporarily put to rest. I no longer have a kafka cluster at my disposal and looking at more local messaging solutions; both are zero broker:

Zero MQ

Aeron]

	pt.instafollowfast.c… on Backtracking In Q/ word l…
	Alex on The Is-Ought Distinction
	Fred Mleczko on How the Mighty have Fallen, or…
	pindash91 on A Better Way To Load Data into…
	Andrew Stanton (@and… on A Better Way To Load Data into…

iabdb

It's All Been Done Before.

Kafka On the Shore: My Experiences Benchmarking Apache Kafka Part II

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply