jstream: a faster, extensible gron
2024-04-29
gron is a popular tool for viewing the structure of large JSON documents. It works by traversing the JSON document and outputting a combination of a given JSON value and the path in the document that points to that value.
From gron's readme:
▶ gron testdata/two.json
json = {};
json.contact = {};
json.contact.email = "mail@tomnomnom.com";
json.contact.twitter = "@TomNomNom";
json.github = "https://github.com/tomnomnom/";
json.likes = [];
json.likes[0] = "code";
json.likes[1] = "cheese";
json.likes[2] = "meat";
json.name = "Tom";
gron
is a great idea for a tool whose main implementation is lacking in a few respects, which I've tried to remedy a few times. My first attempt was jindex, which actually works fine, with the exception of two main flaws: it keeps the entire parsed JSON document in memory for the duration of the program, and it makes a ton of tiny allocations (on the order of 1 per path). This means jindex, when run on large documents, has a large memory overhead, and its propensity for allocation slows execution.
jstream is my attempt solve these issues, to improve on both gron
and jindex
.
jstream
does the same as both, in this case emitting JSON paths rather than Javascript accessors:
$ echo '{
"a": 1,
"b": 2,
"c": ["x", "y", "z"],
"d": {"e": {"f": [{}, 9, "g"]}}
}' | jstream
/a 1
/b 2
/c/0 "x"
/c/1 "y"
/c/2 "z"
/d/e/f/1 9
/d/e/f/2 "g"
jstream
uses a different architecture than both gron
and jindex
. It uses a streaming JSON parser, namely the aws-smithy-json library. This addresses the deficiencies of jindex
by reading the JSON document incrementally, and only loading and emitting paths as it traverses them. It also avoids allocating for every path by reusing a single Vec
through the entire life of the program.
Because of this architecture, jstream
has a dramatically lower memory overhead than jindex
(I don't know about gron
's memory usage; I have not benchmarked it for memory). On a ~180MB document (citylots.json
), jstream
uses only about 1.8MB more memory than the document itself (191,725,568 bytes maximum resident set size vs. 189,778,220 bytes document size). For comparison, jindex
uses 1,139,638,272 bytes maximum resident set size, or ~1087MB.
jstream
is also faster than both jindex
and significantly faster than gron
.
Running against another large-ish document (32MB) generated with the Python script here, the results look like this for gron
(version 0.7.1 from homebrew, both sorted), jindex
, and jstream
:
$ hyperfine -w3 -r9 --output=null "gron big.json"
Benchmark 1: gron big.json
Time (mean ± σ): 7.723 s ± 0.043 s [User: 9.490 s, System: 1.437 s]
Range (min … max): 7.663 s … 7.782 s 9 runs
$ hyperfine -w3 -r9 --output=null "gron --no-sort big.json"
Benchmark 1: gron --no-sort big.json
Time (mean ± σ): 4.336 s ± 0.023 s [User: 6.135 s, System: 1.427 s]
Range (min … max): 4.301 s … 4.369 s 9 runs
$ hyperfine -w3 -r9 --output=null "jindex big.json"
Benchmark 1: jindex big.json
Time (mean ± σ): 483.6 ms ± 4.4 ms [User: 365.2 ms, System: 93.2 ms]
Range (min … max): 477.9 ms … 489.9 ms 9 runs
$ hyperfine -w3 -r9 --output=null "jstream big.json"
Benchmark 1: jstream big.json
Time (mean ± σ): 153.2 ms ± 0.5 ms [User: 142.2 ms, System: 6.8 ms]
Range (min … max): 152.5 ms … 154.1 ms 9 runs
jstream
is also extensible. The current implementation outputs JSON paths by default (like the example above), but jstream
can emit any path formatting you like, even like gron
. You only have to implement one method on one trait to tell jstream
how to print a given path and value combination:
pub trait PathValueWriter {
fn write_path_and_value(&mut self, path: Path, value: JsonAtom) -> std::io::Result<()>;
}
You can get jstream
here.