tl;dr I wrote a JSON parser to improve performance for a very specific use case and found that it was also blazingly fast for the general case.
As is often the case, the really good solutions seem to come when you are solving a problem for yourself (my personal theory is that your own problems stay in the back of your mind at all times where as other people's problems are only there when you are actively thinking about solutions for them). In this case, I had been profiling the batch data ingestion code for our stream processing framework, StroomData. I was noticing that we were spending a large amount of time splitting the batches into individual documents. I thought about it for several days, looked around for a faster JSON parser to use, when suddenly I got an idea.
Batches are delivered to StroomData as a string containing a JSON array:
We want each of those JSON objects extracted as strings that can be passed on separately to the storage engine. Unfortunately, there is no record separator in JSON that lets us quickly extract those objects as strings. The separating comma character can appear anywhere, so we have to parse it.
But, what if while we parse it, we make a note of the indexes into the source string for the start and end of each object? Now we could extract them as simple substrings of the original source string. This was the original idea that got me to start working on what would become LazyJSON.
Parsing without actually extracting tokens is not a new idea. It has been used by other very fast parsers. However that data is usually either thrown away or not available to do the string extraction after parsing the source data. In LazyJSON, we keep the data and build exactly enough of an AST (Abstract Syntax Tree) from it that you can easily extract objects from an array. The following sample code shows the original use case for LazyJSON.
I have chosen to model the API after the json.org Java library, since I find it the most convenient to use for JSON data without a known structure. As long as the LazyJSON parser is competitive in speed, the overall speed should be faster than the competing libraries, since the serialization at the end is nothing more than a substring extraction from the original source string.
The following graph shows a box plot of this operation compared to a handful of other JSON libraries for JSON. The data being parsed is a 550kb string of JSON data containing a single array with 1000 objects in it. These objects are modeled after the metric data we collect at DoubleDutch and feed through StroomData in production.
As can clearly be seen (even if you aren't familiar with a box plot), the performance advantage is massive! For more information about this benchmark have a look at the repo containing the full code and results for our benchmark tests of LazyJSON.
The big benefit demonstrated in the last section is not just performance—you could probably get close to the same levels using the Jackson tokenizer with a handbuilt minimal parser to find the right start and end tokens for the extraction. However, the performance comes along with a very simple-to-use API. After thinking about this for a while, I had the thought: what if I were to build something similar to the full API used to get data in the json.org API, but that would still operate in a lazy fashion, and would only parse the actual raw string data when it had to?
I expected that this approach would be very fast if you were looking to access just a single field or two on large complex JSON objects. This would in fact be one of our use cases at DoubleDutch when we are inspecting the metric type field on incoming data to decide how to process it further. I also expected that it would rapidly lose out to other parsers once we started pulling out more and more fields, since we would now be doing all the work we chose not to do when we first parsed the data. This would be where the lazy approach just wouldn’t cut it, except, that’s not really what happened when we started benchmarking it. Let’s have a look at a graph that shows the time spent parsing a batch of objects and accessing everyone of their fields. This is once again a batch of 1000 objects, but this time much smaller objects with a field for each native scalar type in JSON. The total size of this dataset is 83kb.
Unexpectedly, the lazy approach did not have the big penalty I anticipated. Once again, here is a link to the source code for these benchmarks.
Why is the lazy parser still able to compete here? My best theory is that the lazy parser is a very optimized parser—almost more of a tokenizer than a parser—that uses a very lightweight data structure giving it the overall savings to remain competitive even when all fields are accessed.
Does this mean that LazyJSON is the fastest parser out there and that everyone should use it for anything related to JSON in Java? No, far from it! There are some severe limitations to LazyJSON. First and foremost, it is purely a parser at this point—there is no way to generate data at all. Secondarily, it requires that you maintain the entire source string in memory—in fact, I suspect that Jackson would win many of these benchmarks if the source data was provided as a raw input stream, since they do their own very optimized character decoding.
However, let's have a quick look at the source code used for LazyJSON in this test.
Pretty simple and readable... data1 - 4 are lists used to capture the data to make sure the compiler doesn't remove the code since no one is using the values we extract! Now let's have a look at the code used for the Jackson JsonParser sample in this benchmark:
This is the real power of the encapsulation offered by LazyJSON. We are able to work our way through tokens almost as fast as shown in the above sample, but in a much more readable way!
When should you use LazyJSON?
Use it if your use case is similar to ours! Use it if your own benchmark tests using your own actual data shows a meaningful performance improvement.
Do not use it if you are expecting it to validate your JSON data at the edge of your stack. We do intend for the parser to eventually fully validate the source given to it and we are pretty confident that it already does so, but, go look at the current set of unit tests. All of them are related to verifying that we get the correct output from valid JSON data—not a single one is verifying an error when given invalid data!
Where to go from here?
LazyJSON has been released as open source today under the Apache 2.0 license. We are going to continue using LazyJSON in production. We are very committed to fixing any bugs filed in github immediately!
We have already ported this to C# as well as started testing it on Android. We will follow up with blogs posts about each of these separately, but so far, the results look very good!
We would love if you would try it out and send use all the use cases you find where it does not give you a meaningful improvement over your existing parser as well as any bugs you find!
We have some neat ideas for how to turn it into a JSON generator, and we are going to start work on parsing JSON from a raw input stream. Follow this blog for more information about where this goes in the future and have a look at the StroomData project to see where all of this began!
Interested in solving engineering problems like this one? Check out our jobs board.