This is part II of a series on Benchmarking Kafka Part I can be found here:
In the first part we used Spark to blast a 2gb csv file of 10 million rows into a three machine Kafka cluster. We got the speeds down to about 30 seconds. Which means it would take about 4 hours to blast a Terabyte. Which is fast, but not blazing fast.
Any number I put here will become obsolete within a year, perhaps sooner. Nevertheless, I’ll put myself out there. If on modest hardware we could achieve 1 terabyte in 40 minutes that would be enough I think to impress some people. which is about 400mb/s
Now again, because of Kafka’s memory flush cycle. We can only get the speed we want up to 8gb per machine. Really less, because there is some Ram usage by the os itself and any other applications running on those machines, including in my case Spark usage. So conservatively we can try and get 4gb per machine. At 400mb/s for two minutes straight.
Using some tricks, this kind of throughput can be accomplished on pretty modest hardware.
- no replication
Now the hard part is finding a machine gun that can fire those messages that fast. A distributed solution seems like the best move and replicates real world type of messaging many sources each blasting away messages.
So I fire up a spark instance and load in a large csv file of 10 million rows ~1.8gb. I re-partition the data set to take advantage of the number of cores available to me. And then I run the mapPartitions function, which allows each partition to independently of all others blast kafka with all of it’s messages, eliminating much of the overhead.
I then get a sustained message blast of about 200mb without the machine falling over.
[my interest in benchmarking kafka has been temporarily put to rest. I no longer have a kafka cluster at my disposal and looking at more local messaging solutions; both are zero broker: