Hive and JSON made simple

It seems that JSON has become the lingua france for the Web 2.0 world. It’s simple, extendible, easily parsed by browsers, easily understood by humans, and so on. It’s no surprise then that a lot of our Big Data ETL tasks end up extracting JSON from some external system, aggregating and transforming it, and then outputting JSON to be consumed by various other systems.

Hive already has some builtin mechanisms to deal with JSON, but honestly, I think they are somewhat awkward. The `get_json_object` UDF allows you to pull out specific fields from a JSON string, but requires you to specify with XPATH, which can become hairy, and the output is always a string. There is also the `json_tuple` UDTF, but this is similar to performing multiple `get_json_object` calls at once. These UDF’s assume that you know what fields exist in the JSON ahead of time, and don’t handle structures like arrays and maps very well.

There are several JSON SerDe’s which attempt to simplify dealing with JSON, which can sometimes help, but often are not what we want. You generally define a table which has a single column, which is a JSON string. Often times that’s not the case. We have JSON with other columns ( perhaps an ID or timestamp) or multiple JSON columns to deal with. They often require that the JSON spec be known in advance, and can often be awkward dealing with recursive structures.

There also isn’t an easy way to generate JSON from Hive. The Hive code quickly becomes an ugly mess of `concat` functions, and with nested structures, it is often not clear if all the curly braces and brackets are aligned. It’s frustrating to waste hours of cluster-time (and realtime !!) producing a large data-set which is ‘not-quite’ valid JSON.

To make life easier, Brickhouse provides several UDF’s for parsing and generating JSON to and from Hive structures. Once parsed into a Hive type, data can be manipulated as one normally would ( with all the other UDFs in Brickhouse !!) and then output back into JSON.

json_split and json_map

Let’s start out with the simple specific cases.  Often we’ll have a string containing a JSON array, or a JSON map, and we simply want to interpret them as a Hive list or map. That’s what `json_split` and `json_map` does. ( It’s called `json_split` because it’s similar to the split method to turn a string into an array, but we’re dealing with JSON rather than a straight delimiter).


select json_val
from
mytable
lateral view explode(
json_split(' [ "val1", "val2", "val3" ] ')
) x as json_val;
Output is
val1
val2
val3

view raw

gistfile1.sql

hosted with ❤ by GitHub

Without any extra arguments, it assumes that the values you want are strings.  However any valid (primitive) Hive type is possible.  Pass in the type you would like to interpret the type as into the second argument, and the returned value will be of that type. ( For example ‘bigint’ , ‘string,double’ ).


select
json_key,
json_val
from
mytable
lateral view explode_map(
json_map(' { "key1":1.1, "key2":34.5, "key3":6.7 }', 'string,double')
) x as json_key, json_val;
Output is:
key1 1.1
key2 34.5
key4 6.7

view raw

gistfile1.sql

hosted with ❤ by GitHub

from_json

More often than not, however, we usually have something a little more complex.  With `from_json` we are able to parse any arbitrary JSON schema. To accomplish this, we need to pass as an argument a description of the schema.

There are two ways to do this.  One way is to pass in as the second argument an object of the type you would like returned, as a template object.  This is usually done using the `map()` or `array()` UDF’s to create an empty object whose type could be interpreted.


select
from_json(' { "key1":[0,1,2], "key2":[3,4,5,6], "key3":[7,8,9] } ',
map( "", array( cast(0 as int))))
from mytable;

view raw

gistfile1.sql

hosted with ❤ by GitHub

This was a little awkward to try to explain to people, however, so we introduced a string argument where you could just pass in the Hive type string for the returned type ( ie ‘array<map<string,double>>’  ).


select
from_json(' { "key1":[0,1,2], "key2":[3,4,5,6], "key3":[7,8,9] } ',
'map<string,array<int>>')
from mytable;

view raw

gistfile1.sql

hosted with ❤ by GitHub

JSON maps and named_struct‘s

In JSON, one can have maps where values can be of multiple types. One value in the map could be a string, and another could be an array.  To support this, sometime we don’t want to interpret a JSON map as a Hive map, but rather as a ‘named_struct’.  Using the named_struct in this way allows us to map any arbitrary JSON schema to a Hive type.


select
from_json(' { "key1":"string value", "key2":2.1, "key3":560000 }',
named_struct("key1", "", "key2", cast(0.0 as double), "key3", cast(1 as bigint) ) )
from mytable;

view raw

gistfile1.sql

hosted with ❤ by GitHub

to_json

For the round trip, we just need to convert back into valid JSON. To do this, we just pass in the object we want to print out. The type of the object passed in is interpreted at query parse time to decide the format of the output.


select
to_json( array( 1, 2, 3, 4) )
from mytable;
Output is
" [ 1,2,3,4 ]"

view raw

gistfile1.sql

hosted with ❤ by GitHub

Again, using named_struct’s are useful for generating JSON maps of heterogenous value types.   The Hive UDF `named_struct` is useful for generating arbitrary named_structs.


select
to_json( named_struct("key1", 0, "key2", "a string val", "key3", array( 4,5,6), "key4", map("a",2.3,"b",5.6) ))
from mytable;
Output is
' { "key1":0, "key2":"a string val", "key3":[4,5,6], "key4":{"a":2.3, "b":5.6} } '

view raw

gistfile1.sql

hosted with ❤ by GitHub

Converting to and from CamelCase and snake_case

There can be some more frustrations however. Consumers and producers of JSON are often of a Javascript ilk, and use a particular naming style called ‘CamelCase’, (where adding upper-case letters in your string add humps to the camel). Choosing between CamelCase, and the alternative , underscore or ‘snake’ case ( not sure sure where the snakes come from) is a religious debate which won’t be resolved in any of our lifetimes.

This wouldn’t be a problem, except for the fact that Hive types ( and HiveQL ) is case-independent. There is no difference to the Hive Parser between ‘value’ and ‘VALUE’ and ‘Value’ and ‘VaLuE’ and ‘vAlUe’. So we can never represent a CamelCase JSON schema as a Hive type.

As a cheap workaround, the ‘to_json‘ and ‘from_json‘ take an extra boolean argument, which is a flag to convert to and from CamelCase. ( A bit of a hack, perhaps, but let me know if you can think of something more elegant.). This allows us to output in a format that those Javascript developers will love, and have it cleanly map onto our hive types


select
to_json(
named_struct("camel_case", 1,
"big_double", 9999.99,
"the_really_cool_value", "a string val",
"another_array", array( 4,5,6),
"random_map", map("a",2.3,"b",5.6) ), true)
from mytable;
Output:
{"camelCase":1,"bigDouble":9999.99,"theReallyCoolValue":"a string val","anotherArray":[4,5,6],"randomMap":{"b":5.6,"a":2.3}}

view raw

gistfile1.sql

hosted with ❤ by GitHub

I hope that helps clear some confusion about how to use the JSON UDF’s in Brickhouse. In our organization, dealing with JSON in Hive was possible, but extremely ugly and very brittle in practice. With the JSON UDF’s, I believe it to be just a little better.

This entry was posted in Big Data, Hive and tagged , , , . Bookmark the permalink.

32 Responses to Hive and JSON made simple

  1. Costa says:

    Thanks for the examples.

    Would you be able to provide an example parsing an array of json objects? (They are all uniform.)
    i.e.

    {
    some_val: [{“key1″:”object1”, “key2”:1.0}, {“key1″:”obect2”, “key2”:2.0}]
    }

    • jeromebanks says:

      You can either create a template object with named_struct, array, and map UDF’s or specify a Hive type string
      select from_json(json, named_struct( “some_val”, array( named_struct(“key1”, “”, “key2”, double) ))
      from json_table;

      • Costa says:

        Thanks for the quick reply.

        It looks like the camel case example you have here is out of date. I get an error when trying to include the 3rd parameter. Attempting to use camel case results in a null result from the query. This is when using both 0.6.0 release and a 0.7.0-SNAPSHOT. Note: I had to do my own build of the snapshot with a fix on line 161 to prevent a NullPtrException.

        Here’s the error; the udf checks for only 2 arguments and spits this out.

        FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ‘true’: from_json expects a JSON string and a template object.

      • jeromebanks says:

        Could you file a JIRA ticket against the github project? We’ll get it fixed in the next release.

  2. nick says:

    I have a json –
    original json snippet –
    {“key”: “somehashvalue”,”columns”: [[“Event:2014-03-26 00\\:29\\:13+0200:json”,”{\”user\”:{\”credType\”:\”ADDRESS\”,\”credValue\”:\”01:AA:A4:G1:HH:UU\”,\”cAgent\”:null,\”cType\”:\”ACE\”},\”timestamp\”:1395786553,\”sessionId\”:1395785353,\”className\”:\”Event\”,\”subtype\”:\”CURRENTLYACTIVE\”,\”vType\”:\”TEST\”,\”vId\”:1235080,\”eType\”:\”CURRENTLYACTIVE\”,\”eData\”:\”1\”}”,1395786553381001],[“Event:2014-03-26 00\\:29\\:13+0200:”,””,1395786553381001]]}

    I tried to use Json serde s to parse the above json to my hive columns. However, 1395786553381001 above is not present in a format which SerDe can map to a Hive column. (since Hive understands Json columns/values present after 🙂

    So instead I created a table –
    CREATE TABLE mytesttable (
    key string,
    columns array<array>
    )
    ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’;

    LOAD DATA LOCAL INPATH ‘/home/user/testsample.json’
    OVERWRITE INTO TABLE mytesttable;

    select columns[0][1] from mytesttable;
    gives –
    {“user”:{“credType”:”ADDRESS”,”credValue”:”01:AA:A4:G1:HH:UU”,”cAgent”:null,”cType”:”ACE”},”timestamp”:1395786553,”sessionId”:1395785353,”className”:”Event”,”subtype”:”CURRENTLYACTIVE”,”vType”:”TEST”,”vId”:1235080,”eType”:”CURRENTLYACTIVE”,”eData”:”1″}

    The above appears clean, but then I also need columns[*][2] i.e in a Json hive column for further transformations.

    I wrote a regex hive query to cleanse the original Json (assume it is present in a table tablewithinputjson)

    CREATE TABLE internaltablewithcleanjson (
    columna string)
    ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’;

    INSERT OVERWRITE TABLE internaltablewithcleanjson
    SELECT
    REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, ‘[“][{]’, ‘{‘),'[}][“]’, ‘}’), ‘\\\\’, ”) AS columna
    FROM tablewithinputjson ij;

    {“key”: “somehashvalue”,”columns”: [[“Event:2014-03-26 00:29:13+0200:json”,{“user”:{“credType”:”ADDRESS”,”credValue”:”01:AA:A4:G1:HH:UU”,”cAgent”:null,”cType”:”ACE”},”timestamp”:1395786553,”sessionId”:1395785353,”className”:”Event”,”subtype”:”CURRENTLYACTIVE”,”vType”:”TEST”,”vId”:1235080,”eType”:”CURRENTLYACTIVE”,”eData”:”1″},1395786553381001],[“Event:2014-03-26 00:29:13+0200:”,””,1395786553381001]]}

    But here again, 1395786553381001 cannot be mapped to a hive column since it appears after , and not after :
    (I could add “test”: before 1395786553381001 , but I do not want to customize the input data – since a) Too much customization is something I am not comfortable with b) does not seem to be a good solution c) it would unnecessary waste my hadoop cluster space and time)

    Not to confuse any further, I am not able to come up with a Hive table format that fully parses and maps all the fields in the original Json snippet.

    • jeromebanks says:

      You see, that is why you should use these UDF’s instead of the Hive JsonSerDe’s. The point of the article is that the Brickhouse is an alternative approach, which allows interpreting complex structures easier.

      For this particular issue, why don’t you post onto http://stackoverflow.com . You can get help with your particular problem there.

  3. Sean Roberts says:

    I’ve tested with your exact queries to rule out any other problems and am getting this error:

    java.lang.NoClassDefFoundError: org/joda/time/format/ISODateTimeFormat

    Any clues? I’ve attempted to manually ‘add jar joda-time-2.4.jar’ with no luck

    • jeromebanks says:

      There is a dependency on JODA 2.3 . Make sure that jar is in the same directory as the brick house jar

      • Sean Roberts says:

        Odd. Still seeing issues:

        “`
        ADD JAR joda-time-2.3.jar;
        ADD JAR brickhouse-0.7.0.jar;
        CREATE TEMPORARY FUNCTION from_json AS ‘brickhouse.udf.json.FromJsonUDF’;

        select
        from_json(‘ { “key1”:[0,1,2], “key2”:[3,4,5,6], “key3”:[7,8,9] } ‘,
        ‘map<string,array>’)
        from mytable;
        Exception in thread “main” java.lang.NoClassDefFoundError: org/joda/time/format/ISODateTimeFormat
        “`

      • jeromebanks says:

        Use the HIVE_AUX_JAR_PATH environment variable to point to the directory containing the jar, or the hive.aux.jars.path set in hive-site.xml file.

        If you can’t set either of those, use the hive ‘set’ command to set hive.aux.jars.path variable from the Hive CLI.

  4. terry says:

    what is the first parameter for from_json function? could it be a hive table name?

    I have already store one json document into a hive table and want to split array into columns.

    select from_json(hive_table, named_struct(“obj”, array(named_struct(“description”, string, “timestamp” string, “prefix”, string, “tags”, “”, “encodedRowkey”, string)))) from hive_table;

    but it returns: FAILED: SemanticException Line 0:-1 Invalid table alias or column reference ‘string’: (possible column names are: json)

    • jeromebanks says:

      Terry, the first column needs to be a string, containing the JSON to be parsed. If your table is simply JSON text strings delimited by ‘\n’ , create an external table with only one column which is your json string.

  5. Max says:

    first of all thanks for those UDFs!

    I have, hopefully, a simple question.. I have a json like this:
    [{“a”:{“key1″:”value1″,”key2″:”value2”}}]
    which seems to be a valid json.. is it possible using any of UDFs above to get result
    key1 | key2 — just columns, not result row 🙂
    “value1” | “value2”

    thanks a lot!

    • jeromebanks says:

      Max,
      Something like
      SELECT map_keys( mp[0].a ), map_values( mp[0].a )
      FROM
      ( SELECT from_json( json, ‘array<struct<a:map>>’) as mp
      FROM mutable
      ) jsp;
      should work if I understand your question. Just define a type for the JSON schema, and then access the hive struct elements.

      • Max says:

        thanks for such a quick reply! I guess I’ve got an idea, but something is weird… Just played a bit to trace what is wrong, and… when I run this script:

        SELECT
        from_json( json, ‘array<struct<a:map>>’) as mp
        FROM

        (
        SELECT
        ‘[{“a”:{“key1″:”value1″,”key2″:”value2”}}]’ as json
        FROM logs
        ) mutable

        everything works well, I get the results… But when I change “a” to “Items”, nothing works.. i also see no meaningfull exception message to get what makes this difference (ERROR beeswax.BeeswaxServiceImpl: Caught BeeswaxException) so this script:

        SELECT
        from_json( json, ‘array<struct<Items:map>>’) as mp
        FROM

        (
        SELECT
        ‘[{“Items”:{“key1″:”value1″,”key2″:”value2”}}]’ as json
        FROM logs
        ) mutable

        does not work anymore… Could it be a strange bug in the from_json UDF?

      • jeromebanks says:

        Try <array<struct<items:map>>

  6. Max says:

    regards “Items” issue: just changing array<struct> to array<struct> does not help… when I change “Items” to “items” in the json as well, then it start working… since I cannot fix json to use lower case, I should lowercase it in the hive query… Thanks!

  7. Sri says:

    Can the UDF read multiline JSON file, means each record if of multiple lines

  8. andy says:

    Hello everybody,

    I have a JSON object as the following:

    ‘{“key1″:”value1″,”key2″:”value2″,”key3″:”value3″,”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}’

    I would like to convert this into a map object using from_json. I tried the following:

    select from_json(‘{“key1″:”value1″,”key2″:”value2″,”key3″:”value3″,”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}’, ‘map<string,array<array>>’) from my_table limit 1;

    which returns:
    {“key1″:[],”key3″:[],”key2″:[],”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}

    As one can see the function correctly converts the value of “key4”, but does not recognize the values of the other keys, they are all empty.

    If I try the following the vallues of all the keya are recognized but of “key4”:

    select from_json(‘{“key1″:”value1″,”key2″:”value2″,”key3″:”value3″,”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}’, ‘map’) from my_table limit 1;
    {“key1″:”value1″,”key3″:”value3″,”key2″:”value2″,”key4”:null}

    What can I do so that my map object contains all the values from my json object?

    • jeromebanks says:

      Andy, Please check out my response on stackoverflow. Basically you need to define the schema as a named_struct, and specify the types of each of the keys independently.

  9. venkat says:

    This blog is very helpful to me,thankyou & I hope it helps the people who don’t know hive json.

  10. Pingback: Hadoop Developer Interview Questions for Experienced – Part1 – BigData and Scala Learning Blog

  11. Pingback: Hadoop Interview Question by SudheerThulluri | Hadoop Reference Links

  12. longjiang Yang says:

    This blog is very helpful and the library of UDFs are very useful.
    One quick question/feedback on the named_struct, would it be possible to have it self discover the schema of the JSON structure? I have a JSON documents where I may not know the available keys beforehand and the values could be different types (arrays, strings, maps). So, I can’t use from_json or named_struct.

  13. Suyash singh says:

    Hi,
    Thanks for this guide. Very helpful!

    I want to save the output of my to_json function into a JSON file. Could you please help me with this?

  14. VK says:

    Hi
    I am looking forward to use the UDFS provided by you . I have a RDBMS table that I need to bring it to Hadoop . The source table has one column as JSON and other columns of primitive type . As sample of the data looks like this .
    {
    “col1” : “INCOMING”,
    “col2” : “2”,
    “col3” : “test”,
    “col4” : “LBHON0540963-073”
    }
    ,12 , 20 , lncell ,test|12|go,1496729457607

    where first coilumn is a json string and rest of the columns are of mixed type ( string , int etc), the last column being the time stamp . With the other columns mixed with json string I am facing a challenge in getting it into hadoop .
    Below are the things I have tried.
    1.Tried with just the json column by creating an avro table in hadoop and it works fine , but since I need to sqoop the records on daily basis , I will have to use the time stamp column .
    2. I tried to use the json serde (https://github.com/rcongiu/Hive-JSON-Serde) but was getting NULLs.
    Would appreciate your guidance on handling this kind of scenario using your UDFs.
    Regards
    Vinay

    • jeromebanks says:

      Vinay,
      This is the case where a UDF makes more sense than a SerDe, because a SerDe must interpret the entire row. In your case, simply create a TEXT table in Hive, delimited by “,”. Then use “json_map”, or “from_json” UDF’s when selecting from the first column of the table.

      CREATE TABLE source(
      json STRING,
      x INT, y INT … )
      DELIMITED BY “,” …

      SELECT json_map( json ) FROM source ….

      Something like that should work.

  15. VK says:

    Thanks so much Jerome for such a prompt reply 🙂
    Will try as you have suggested and update here .
    Thanks again
    Vinay

Leave a reply to jeromebanks Cancel reply