One of our services recently started to perform multiple inserts and deletes in MySQL database. To the point of a noticeable response time increase. Batching of SQL queries is nothing new, but I decided to wander around this topic a bit in Internet, and stumbled upon something I never heard before (or blissfully forgot). A rewriteBatchedStatements property of MySQL JDBC driver. Here I am, benchmarking this thing and checking other options…
Disclaimer 1. To be honest, this post doesn’t contain anything new or special in it. I’ve just discovered something interesting and would like to share it.
Disclaimer 2. This time I haven’t prepared standalone code example, as I did it for a specific task inside of our bazel monorepo using some of our internal tooling. Sorry. But there’s nothing special in those benchmarks.
What happened?
Initially we used queries like this (all examples are for scalikejdbc but it doesn’t really matter):
case class Entity(id: String, val1: String, val2: Boolean, val3: Array[Byte])
def add(entity: Entry): Unit = autoCommit { implicit session =>
sql"""
INSERT INTO tbl (id, col1, col2, col3)
VALUES (${entity.id}, ${entity.val1}, ${entity.val2}, ${entity.val3})
"""
.update()
.apply()
}
def remove(id: String): Unit = autoCommit { implicit session =>
sql"""DELETE FROM tbl WHERE id = $id"""
.update()
.apply()
}
As system evolved, we started calling those DAO methods in a loop (gotcha!). Eventually number of entities grew, and we got a problem :)
Batching
Batching is an attempt to reduce number of round-trips from application to a database. So, what driver does is — instead of sending a bunch of separate queries, it sends it this way:
INSERT INTO tbl (...) VALUES(...);
INSERT INTO tbl (...) VALUES(...);
Note the ;
at the end of the query. Those are still separate queries, but it’s sent as a single request. To code change
is not that big:
def add(entities: Seq[Entity]): Unit = localTx { implicit session =>
val params = entities.map { entity =>
Seq(
"id" -> entity.id,
"val1" -> entity.val1,
"val2" -> entity.val2,
"val3" -> entity.val3,
)
}
sql"""
INSERT INTO tbl (id, col1, col2, col3)
VALUES ({id}, {val1}, {val2}, {val3})
"""
.batchByName(params: _*)
.apply()
}
Similar thing we can do for DELETE:
def remove(ids: Seq[String]): Unit = localTx { implicit session =>
val params = ids.map(v => Seq(v))
sql"""DELETE FROM tbl WHERE id = ?"""
.batch(params: _*)
.apply()
}
Better DELETE (IN clause)
Actually, for delete we can do much more robust thing:
def remove(ids: Seq[String]): Unit = autoCommit { implicit session =>
val inClause = SQLSyntax.csv(ids.map(id => sqls"""$id"""): _*)
sql"""DELETE FROM tbl WHERE id IN ($inClause)"""
.update()
.apply()
}
rewriteBatchedStatements
I stumbled upon this parameter on StackOverflow and instantly decided to check how it works. Basically, it rewrites INSERT queries in multi-value queries:
INSERT INTO tbl (...)
VALUES (...), (...), (...)
Which makes it slightly more concise (less SQL) and something inside MySQL makes it faster.
Benchmarks
First, I verified that rewriteBatchedStatements
actually works by enabling profile logs
in driver, it showed that final queries. Second, I built a very simple benchmark that tests different flavors of batching
for INSERT
and for DELETE
with different number of entries. Each entry in my test is about few hundreds of bytes,
multiple columns, nothing fancy.
INSERT
As you may see in benchmark results below, multi-value INSERT is really the fastest.
oneByOne
is just a loop outside the DAO. Total time increases linearly (obviously). The problem here is simple — the
number of transactions in MySQL equals to number of rows that we’re inserting. And a transaction takes time.
Solution to this is to start transaction only once and then make queries in the loop within this transaction. This case
represented by oneByOneInTransaction
in the table. As you may see, its performance is comparable to a regular batching.
Benchmark (numberOfEntities) Mode Cnt Score Error Units
batch 1 avgt 2 12.797 ms/op
batchRewriting 1 avgt 2 11.769 ms/op
oneByOneInTransaction 1 avgt 2 12.624 ms/op
oneByOne 1 avgt 2 12.184 ms/op
batch 10 avgt 2 13.433 ms/op
batchRewriting 10 avgt 2 11.835 ms/op
oneByOneInTransaction 10 avgt 2 15.592 ms/op
oneByOne 10 avgt 2 125.161 ms/op
batch 100 avgt 2 29.763 ms/op
batchRewriting 100 avgt 2 22.480 ms/op
oneByOneInTransaction 100 avgt 2 35.664 ms/op
oneByOne 100 avgt 2 1281.417 ms/op
batch 1000 avgt 2 213.938 ms/op
batchRewriting 1000 avgt 2 148.009 ms/op
oneByOneInTransaction 1000 avgt 2 229.646 ms/op
batch 10000 avgt 2 2027.138 ms/op
batchRewriting 10000 avgt 2 1497.429 ms/op
oneByOneInTransaction 10000 avgt 2 2321.587 ms/op
DELETE
In case of DELETE the option rewriteBatchedStatements
shouldn’t affect anything. However, it’s still better than regular
batching.
As expected, by using IN
clause we get the best performance, as it’s a single query (just like in case of INSERT with
multi-values).
Benchmark (numberOfEntities) Mode Cnt Score Error Units
batch 1 avgt 2 21.636 ms/op
batchRewriting 1 avgt 2 15.237 ms/op
inClause 1 avgt 2 13.483 ms/op
oneByOneInTransaction 1 avgt 2 10.938 ms/op
oneByOne 1 avgt 2 12.273 ms/op
batch 10 avgt 2 16.328 ms/op
batchRewriting 10 avgt 2 14.396 ms/op
inClause 10 avgt 2 11.184 ms/op
oneByOneInTransaction 10 avgt 2 13.085 ms/op
oneByOne 10 avgt 2 124.575 ms/op
batch 100 avgt 2 21.893 ms/op
batchRewriting 100 avgt 2 17.696 ms/op
inClause 100 avgt 2 13.029 ms/op
oneByOneInTransaction 100 avgt 2 24.492 ms/op
oneByOne 100 avgt 2 1181.656 ms/op
batch 1000 avgt 2 104.244 ms/op
batchRewriting 1000 avgt 2 83.070 ms/op
inClause 1000 avgt 2 25.444 ms/op
oneByOneInTransaction 1000 avgt 2 130.383 ms/op
batch 10000 avgt 2 925.338 ms/op
batchRewriting 10000 avgt 2 854.990 ms/op
inClause 10000 avgt 2 167.237 ms/op
oneByOneInTransaction 10000 avgt 2 1254.676 ms/op
Conclusion
The way to achieve the best performance with a database is to use the least amount of queries. In case of INSERT
it’s
a multi-value query, in case of DELETE
it’s an IN
clause with multiple identifiers specified. For DELETE
queries
it’s on us to write it properly, and for INSERT
queries there is a very nice driver option that converts your batch
query into multi-value query and boosts the performance auto-magically!