It seems that JSON has become the lingua france for the Web 2.0 world. It’s simple, extendible, easily parsed by browsers, easily understood by humans, and so on. It’s no surprise then that a lot of our Big Data ETL tasks end up extracting JSON from some external system, aggregating and transforming it, and then outputting JSON to be consumed by various other systems.
Hive already has some builtin mechanisms to deal with JSON, but honestly, I think they are somewhat awkward. The `get_json_object` UDF allows you to pull out specific fields from a JSON string, but requires you to specify with XPATH, which can become hairy, and the output is always a string. There is also the `json_tuple` UDTF, but this is similar to performing multiple `get_json_object` calls at once. These UDF’s assume that you know what fields exist in the JSON ahead of time, and don’t handle structures like arrays and maps very well.
There are several JSON SerDe’s which attempt to simplify dealing with JSON, which can sometimes help, but often are not what we want. You generally define a table which has a single column, which is a JSON string. Often times that’s not the case. We have JSON with other columns ( perhaps an ID or timestamp) or multiple JSON columns to deal with. They often require that the JSON spec be known in advance, and can often be awkward dealing with recursive structures.
There also isn’t an easy way to generate JSON from Hive. The Hive code quickly becomes an ugly mess of `concat` functions, and with nested structures, it is often not clear if all the curly braces and brackets are aligned. It’s frustrating to waste hours of cluster-time (and realtime !!) producing a large data-set which is ‘not-quite’ valid JSON.
To make life easier, Brickhouse provides several UDF’s for parsing and generating JSON to and from Hive structures. Once parsed into a Hive type, data can be manipulated as one normally would ( with all the other UDFs in Brickhouse !!) and then output back into JSON.
Let’s start out with the simple specific cases. Often we’ll have a string containing a JSON array, or a JSON map, and we simply want to interpret them as a Hive list or map. That’s what `json_split` and `json_map` does. ( It’s called `json_split` because it’s similar to the split method to turn a string into an array, but we’re dealing with JSON rather than a straight delimiter).
Without any extra arguments, it assumes that the values you want are strings. However any valid (primitive) Hive type is possible. Pass in the type you would like to interpret the type as into the second argument, and the returned value will be of that type. ( For example ‘bigint’ , ‘string,double’ ).
More often than not, however, we usually have something a little more complex. With `from_json` we are able to parse any arbitrary JSON schema. To accomplish this, we need to pass as an argument a description of the schema.
There are two ways to do this. One way is to pass in as the second argument an object of the type you would like returned, as a template object. This is usually done using the `map()` or `array()` UDF’s to create an empty object whose type could be interpreted.
This was a little awkward to try to explain to people, however, so we introduced a string argument where you could just pass in the Hive type string for the returned type ( ie ‘array<map<string,double>>’ ).
JSON maps and
In JSON, one can have maps where values can be of multiple types. One value in the map could be a string, and another could be an array. To support this, sometime we don’t want to interpret a JSON map as a Hive map, but rather as a ‘named_struct’. Using the named_struct in this way allows us to map any arbitrary JSON schema to a Hive type.
For the round trip, we just need to convert back into valid JSON. To do this, we just pass in the object we want to print out. The type of the object passed in is interpreted at query parse time to decide the format of the output.
Again, using named_struct’s are useful for generating JSON maps of heterogenous value types. The Hive UDF `named_struct` is useful for generating arbitrary named_structs.
Converting to and from CamelCase and snake_case
This wouldn’t be a problem, except for the fact that Hive types ( and HiveQL ) is case-independent. There is no difference to the Hive Parser between ‘value’ and ‘VALUE’ and ‘Value’ and ‘VaLuE’ and ‘vAlUe’. So we can never represent a CamelCase JSON schema as a Hive type.
As a cheap workaround, the ‘
to_json‘ and ‘
I hope that helps clear some confusion about how to use the JSON UDF’s in Brickhouse. In our organization, dealing with JSON in Hive was possible, but extremely ugly and very brittle in practice. With the JSON UDF’s, I believe it to be just a little better.