UPD-2022: A remake of this blog post on 2022. UPD: Recent benchmark charts are here.
It’s common to use JSON as the main format of serialized data. It’s very convenient to use it both on client and server. Obviously, it’s not the best choice in terms of both data size and performance.
This article mainly focused on data size/performance of binary serialization libraries for Scala. Java versions are used just to compare with.
TL;DR
Protobuf is small and fast. Scala’s implementation — ScalaPB — is robust and convenient. Both for many small and big messages.
Use cases
Many performance tests suffers from a synthetic nature of data. This test is not an exception, but here I test existing almost-production like data models (simplified a bit, of course). One case is a rich DTO (data transfer object), the second one is a list of small events from which a rich DTO could be reconstructed.
Tested libraries
I chose several libraries for a testing:
- ScalaPB. A Google Protocol Buffers compiler implementation for Scala. All data objects are case classes, all protobuf features are supported.
- Pickling. Meant to be a Scala alternative to default Java serialization. Uses macros to generate serializers/parsers (picklers/unpicklers in its terminology).
- Boopickle. Custom binary format without backward-compatibility. Also uses macros.
- Chill. Twitter’s extension for Kryo.
- Scrooge. A Thrift compiler implementation for Scala.
For Protobuf and Thrift, I also used Java implementation to check compatibility and performance with the original. Also, I added to comparison a default Java Serialization (Serializable interface + ObjectInputStream).
All libraries I checked against well-known JSON serialization library — Jackson.
It’s worth to mention couple libraries, that I decided not to include a comparison:
- Protostuff. An interesting library, supports protobuf and its own binary format, but… It’s hard to support Scala Collections there (without changes inside library it’s impossible).
- MsgPack. Scala implementation of this library doesn’t support case classes. The idea is interesting — to replace quotes, colons etc., but the gain is ~25%. Doesn’t seem to be a big deal.
Also, I want to promote an interesting research — JVM Serializers. It’s a good starting point to pick your serialization library.
Test data models
I tested libraries against two types of data objects:
- Site — a rich data transfer object with fields, lists etc.
- Events — a list of simple flat objects with at most 4 flat fields.
Site
A rich DTO consists of several fields with simple data types (UUID, dates, strings, etc) and nested lists of other rich objects (without back-references). Also, there are couple rich objects with subclasses (see an example of EntryPoint).
The Site represents a web-site: contains information about pages, meta tags, entry points and other meta information.
case class Site(id: UUID,
ownerId: UUID,
revision: Long,
siteType: SiteType,
flags: Seq[SiteFlag],
name: String,
description: String,
domains: Seq[Domain],
defaultMetaTags: Seq[MetaTag],
pages: Seq[Page],
entryPoints: Seq[EntryPoint],
published: Boolean,
dateCreated: Instant,
dateUpdated: Instant)
sealed trait EntryPoint
final case class DomainEntryPoint(domain: String,
primary: Boolean)
extends EntryPoint {
}
final case class FreeEntryPoint(userName: String,
siteName: String,
primary: Boolean)
extends EntryPoint {
}
Events
I’ve implemented as simple as possible event model for building Site snapshot (as if it was event sourced). All events are very small, represents a single change in a model (very granular):
sealed trait SiteEvent
case class SiteCreated(id: UUID, ownerId: UUID, siteType: SiteType) extends SiteEvent
case class SiteNameSet(name: String) extends SiteEvent
case class SiteDescriptionSet(description: String) extends SiteEvent
case class SiteRevisionSet(revision: Long) extends SiteEvent
case class SitePublished() extends SiteEvent
// etc
Events come as a sequence, ordered by a timestamp. For example, in MySQL, it could be stored in such table:
CREATE TABLE site_events (
id BINARY(16) NOT NULL,
timestamp BIGINT NOT NULL,
event_type INT NOT NULL,
event_digest MEDIUMBLOB NOT NULL,
PRIMARY KEY (id)
) ENGINE=InnoDB;
An example of how to reconstruct a Site from events (EventProcessor):
def apply(list: Seq[SiteEventData]): Site = {
list.foldLeft(emptySite) {
case (s, SiteEventData(_, SiteRevisionSet(rev), dateUpdated)) =>
s.copy(revision = rev, dateUpdated = dateUpdated)
case (s, SiteEventData(_, SiteNameSet(name), _)) =>
s.copy(name = name)
...
Tests
There are 5 objects (Site) to test: 1k, 2k, 4k, 8k, and 64k. This mnemonic means: 1k is the object, that is present as approximately 1 kilobyte JSON; 2k is ~2 kilobyte JSON etc.
Events are produced from these 5 objects: 1k events are events from which may be reconstructed a 1k object, etc.
Data size
Sizes (in bytes) for Site (rich DTO):
Converter 1k 2k 4k 8k 64k
JSON 1060 2076 4043 8173 65835
Boopickle 544 1130 1855 2882 16290
Protobuf 554 1175 1930 3058 27111
Thrift 712 1441 2499 4315 38289
Chill 908 1695 2507 3643 26261
Java 2207 3311 4549 6615 43168
Pickling 1628 2883 5576 11762 97997
BooPickle is the leader (and this is understandable — this library doesn’t support backward compatibility, so, they don’t need to save field tags). Chill demonstrates better results for the very big object. Thrift is not so good (maybe, it’s because of implementation for optional fields).
Sizes (in bytes) for events (sum for all events in list):
Converter 1k 2k 4k 8k 64k
JSON 1277 2499 5119 10961 109539
Boopickle 593 1220 2117 3655 42150
Protobuf 578 1192 2076 3604 42455
Thrift 700 1430 2639 4911 57029
Chill 588 1260 2397 3981 47048
Java 2716 5078 11538 26228 240267
Pickling 1565 3023 6284 13462 128797
Now protobuf looks even better.
Protobuf, thrift, chill and boopickle are almost 2.5 times more compact than JSON. Big object serializes better with Java Serialization than Pickling, and small objects — vice versa.
Data size and compression
Another interesting topic about data size is the compression. A compression is widely used in modern systems, from databases (i.e. “compress” row format in MySQL) to networks (GZip over HTTP). So, the data size could be not so important to look at. The comparison table for gzipped and raw object are pretty big, I will show a small part (an entire table available here).
Converter site 2k events 2k site 8k events 8k
JSON (raw) 2076 2499 8173 10961
JSON (gz) 1137 2565 2677 11784
Protobuf (raw) 1175 1192 3058 3604
Protobuf (gz) 898 1463 2175 5552
Thrift (raw) 1441 1430 4315 4911
Thrift (gz) 966 1669 2256 6673
Important note: events are gzipped not together, but one by one, of course, if it would be gzipped together, result size would be much less. It shows an importance of choosing the right storage/transfer mechanism.
We may see, that on small objects (less that 2k) protobuf has almost the same size as gzipped JSON (so, we can save some CPU cycles).
Performance
The next important thing is the performance of serialization and deserialization (parsing). Our tests are about serializing/deserializing the raw data to Scala objects. In order to simplify this testing, I converted generated classes to “domain” classes, so, for protobuf and thrift there is also an addition of object conversion (I don’t think that the effect of this addition is significant).
I excluded Java Serialization and Pickling from this chart (and other charts also) because both of them are very slow. I will write about it afterward.
The code for the performance tests is in BasePerfTest.scala.
Serialization performance
On the following chart you may see the serialisitezation times for Site object (measured in nano-seconds with System.nanoTime method).
ScalaPB is the pure winner, Java protobuf and Thrift goes after. BooPickle and Child are slightly slower for “small” objects, and a bit better for bigger objects.
The next chart is serialization times for events.
ScalaPB still is the winner. BooPickle is very slow in this contest. Apparently, many small messages are not the proper scenario for it.
In terms of numbers, ScalaPB is faster than JSON more than 2 times for a single rich DTO, and more that 4 times faster than JSON for a list of small events.
Deserialization (parsing) performance
The next chart is deserialization times for Site object:
ScalaPB is the winner. BooPickle looks much better, apparently, there are many optimizations during the serialization that costs a lot.
For events, the fastest library is Java Protobuf (I don’t know why, but it confirmed after several runs).
ScalaPB is ~3 times faster for rich DTO and ~3–4 times faster than JSON for a list of small events. Numbers are 2 microseconds vs. 7 microseconds for 1k Site and 3 microseconds vs. 12 microseconds for 1k events.
Java Serialization and Pickling
Pickling performance is surprisingly bad. I think I missed something and did it wrong (I used all tips from its manual), but Pickling is the slowest library in this test.
Just to compare its performance. Serialization of a rich DTO:
Converter 1k 2k 4k 8k 64k
JSON 4365 8437 16771 35164 270175
Serializable 13156 21203 36457 79045 652942
Pickling 53991 83601 220440 589888 4162785
Deserialization of a rich DTO:
Converter 1k 2k 4k 8k 64k
JSON 7670 12964 24804 51578 384623
Serializable 61455 84196 102870 126839 575232
Pickling 40337 63840 165109 446043 3201348
Conclusion
As expected, binary serialization is faster and produce less data. ScalaPB showed very good results (and the protobuf format in general).
Nevertheless, performance and data size are not enough to make a decision to move from using JSON to protobuf. But, it’s important to know how much does it cost.
Originally posted on Medium.