jindex - enumerating paths in large JSON documents

2020-04-08

A while back (a year ago?) I was debugging an issue on embedded hardware that dealt with processing large (megabytes) JSON documents. I was using jq to format these blobs, but it was difficult knowing where anything was, as the documents were so large and complex. I was tracing paths by hand to try to figure out how to access specific values in these huge documents, and it got tedious. I had the idea that you could compute the pathwise index for every value in a JSON document. A JSON document is a tree, so this is basically a tree traversal that shows you how to get from the root to every leaf. You could then grep for the values you were interested in, and the traversal paths would be shown to you for those specific values.

I tried and failed a few times to write this tool, but ultimately succeeded in writing jindex.

Using it looks like this:

$ echo '{
  "a": 1,
  "b": 2,
  "c": ["x", "y", "z"],
  "d": {"e": {"f": [{}, 9, "g"]}}
}' | jindex

["a"]   1
["b"]   2
["c", 0]        "x"
["c", 1]        "y"
["c", 2]        "z"
["d", "e", "f", 0]      {}
["d", "e", "f", 1]      9
["d", "e", "f", 2]      "g"

By default, the path (on the left) is separated from the value (on the right) by a tab character. This is configurable with a command line option. Each path/value combination is separated from subsequent pairs by a newline.

You can see that it does a breadth-first traversal, showing values at each level before moving deeper in the document.

Let's say I am only interested in accessing the value 9, deeply nested under the "d" key. Combining it with grep looks like this:

$ echo '{
  "a": 1,
  "b": 2,
  "c": ["x", "y", "z"],
  "d": {"e": {"f": [{}, 9, "g"]}}
}' | jindex | grep "9"
["d", "e", "f", 1]      9

Now you know that to access 9, you need to lookup ["d", "e", "f", 1] from the root object.

There are a few other options but that's basicallly all there is to it.

Internally, it makes use of a queue (rather than recursion) to perform the tree traversal, and reference-counted pointers to do structural sharing to minimize memory copying.

You can download and find the source here.