Hive and JSON made simple

Thanks for the examples.

Would you be able to provide an example parsing an array of json objects? (They are all uniform.)
i.e.

{
some_val: [{“key1″:”object1”, “key2”:1.0}, {“key1″:”obect2”, “key2”:2.0}]
}

You can either create a template object with named_struct, array, and map UDF’s or specify a Hive type string
select from_json(json, named_struct( “some_val”, array( named_struct(“key1”, “”, “key2”, double) ))
from json_table;

Thanks for the quick reply.

It looks like the camel case example you have here is out of date. I get an error when trying to include the 3rd parameter. Attempting to use camel case results in a null result from the query. This is when using both 0.6.0 release and a 0.7.0-SNAPSHOT. Note: I had to do my own build of the snapshot with a fix on line 161 to prevent a NullPtrException.

Here’s the error; the udf checks for only 2 arguments and spits this out.

FAILED: SemanticException [Error 10014]: Line 1:7 Wrong arguments ‘true’: from_json expects a JSON string and a template object.

Could you file a JIRA ticket against the github project? We’ll get it fixed in the next release.

I have a json –
original json snippet –
{“key”: “somehashvalue”,”columns”: [[“Event:2014-03-26 00\\:29\\:13+0200:json”,”{\”user\”:{\”credType\”:\”ADDRESS\”,\”credValue\”:\”01:AA:A4:G1:HH:UU\”,\”cAgent\”:null,\”cType\”:\”ACE\”},\”timestamp\”:1395786553,\”sessionId\”:1395785353,\”className\”:\”Event\”,\”subtype\”:\”CURRENTLYACTIVE\”,\”vType\”:\”TEST\”,\”vId\”:1235080,\”eType\”:\”CURRENTLYACTIVE\”,\”eData\”:\”1\”}”,1395786553381001],[“Event:2014-03-26 00\\:29\\:13+0200:”,””,1395786553381001]]}

I tried to use Json serde s to parse the above json to my hive columns. However, 1395786553381001 above is not present in a format which SerDe can map to a Hive column. (since Hive understands Json columns/values present after 🙂

So instead I created a table –
CREATE TABLE mytesttable (
key string,
columns array<array>
)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’;

LOAD DATA LOCAL INPATH ‘/home/user/testsample.json’
OVERWRITE INTO TABLE mytesttable;

select columns[0][1] from mytesttable;
gives –
{“user”:{“credType”:”ADDRESS”,”credValue”:”01:AA:A4:G1:HH:UU”,”cAgent”:null,”cType”:”ACE”},”timestamp”:1395786553,”sessionId”:1395785353,”className”:”Event”,”subtype”:”CURRENTLYACTIVE”,”vType”:”TEST”,”vId”:1235080,”eType”:”CURRENTLYACTIVE”,”eData”:”1″}

The above appears clean, but then I also need columns[*][2] i.e in a Json hive column for further transformations.

I wrote a regex hive query to cleanse the original Json (assume it is present in a table tablewithinputjson)

CREATE TABLE internaltablewithcleanjson (
columna string)
ROW FORMAT SERDE ‘org.openx.data.jsonserde.JsonSerDe’;

INSERT OVERWRITE TABLE internaltablewithcleanjson
SELECT
REGEXP_REPLACE(REGEXP_REPLACE(REGEXP_REPLACE(ij.columna, ‘[“][{]’, ‘{‘),'[}][“]’, ‘}’), ‘\\\\’, ”) AS columna
FROM tablewithinputjson ij;

{“key”: “somehashvalue”,”columns”: [[“Event:2014-03-26 00:29:13+0200:json”,{“user”:{“credType”:”ADDRESS”,”credValue”:”01:AA:A4:G1:HH:UU”,”cAgent”:null,”cType”:”ACE”},”timestamp”:1395786553,”sessionId”:1395785353,”className”:”Event”,”subtype”:”CURRENTLYACTIVE”,”vType”:”TEST”,”vId”:1235080,”eType”:”CURRENTLYACTIVE”,”eData”:”1″},1395786553381001],[“Event:2014-03-26 00:29:13+0200:”,””,1395786553381001]]}

But here again, 1395786553381001 cannot be mapped to a hive column since it appears after , and not after :
(I could add “test”: before 1395786553381001 , but I do not want to customize the input data – since a) Too much customization is something I am not comfortable with b) does not seem to be a good solution c) it would unnecessary waste my hadoop cluster space and time)

Not to confuse any further, I am not able to come up with a Hive table format that fully parses and maps all the fields in the original Json snippet.

You see, that is why you should use these UDF’s instead of the Hive JsonSerDe’s. The point of the article is that the Brickhouse is an alternative approach, which allows interpreting complex structures easier.

For this particular issue, why don’t you post onto http://stackoverflow.com . You can get help with your particular problem there.

I’ve tested with your exact queries to rule out any other problems and am getting this error:

java.lang.NoClassDefFoundError: org/joda/time/format/ISODateTimeFormat

Any clues? I’ve attempted to manually ‘add jar joda-time-2.4.jar’ with no luck

There is a dependency on JODA 2.3 . Make sure that jar is in the same directory as the brick house jar

Odd. Still seeing issues:

“`
ADD JAR joda-time-2.3.jar;
ADD JAR brickhouse-0.7.0.jar;
CREATE TEMPORARY FUNCTION from_json AS ‘brickhouse.udf.json.FromJsonUDF’;

select
from_json(‘ { “key1”:[0,1,2], “key2”:[3,4,5,6], “key3”:[7,8,9] } ‘,
‘map<string,array>’)
from mytable;
Exception in thread “main” java.lang.NoClassDefFoundError: org/joda/time/format/ISODateTimeFormat
“`

Use the HIVE_AUX_JAR_PATH environment variable to point to the directory containing the jar, or the hive.aux.jars.path set in hive-site.xml file.

If you can’t set either of those, use the hive ‘set’ command to set hive.aux.jars.path variable from the Hive CLI.

what is the first parameter for from_json function? could it be a hive table name?

I have already store one json document into a hive table and want to split array into columns.

select from_json(hive_table, named_struct(“obj”, array(named_struct(“description”, string, “timestamp” string, “prefix”, string, “tags”, “”, “encodedRowkey”, string)))) from hive_table;

but it returns: FAILED: SemanticException Line 0:-1 Invalid table alias or column reference ‘string’: (possible column names are: json)

Terry, the first column needs to be a string, containing the JSON to be parsed. If your table is simply JSON text strings delimited by ‘\n’ , create an external table with only one column which is your json string.

first of all thanks for those UDFs!

I have, hopefully, a simple question.. I have a json like this:
[{“a”:{“key1″:”value1″,”key2″:”value2”}}]
which seems to be a valid json.. is it possible using any of UDFs above to get result
key1 | key2 — just columns, not result row 🙂
“value1” | “value2”

thanks a lot!

Max,
Something like
SELECT map_keys( mp[0].a ), map_values( mp[0].a )
FROM
( SELECT from_json( json, ‘array<struct<a:map>>’) as mp
FROM mutable
) jsp;
should work if I understand your question. Just define a type for the JSON schema, and then access the hive struct elements.

thanks for such a quick reply! I guess I’ve got an idea, but something is weird… Just played a bit to trace what is wrong, and… when I run this script:

SELECT
from_json( json, ‘array<struct<a:map>>’) as mp
FROM

(
SELECT
‘[{“a”:{“key1″:”value1″,”key2″:”value2”}}]’ as json
FROM logs
) mutable

everything works well, I get the results… But when I change “a” to “Items”, nothing works.. i also see no meaningfull exception message to get what makes this difference (ERROR beeswax.BeeswaxServiceImpl: Caught BeeswaxException) so this script:

SELECT
from_json( json, ‘array<struct<Items:map>>’) as mp
FROM

(
SELECT
‘[{“Items”:{“key1″:”value1″,”key2″:”value2”}}]’ as json
FROM logs
) mutable

does not work anymore… Could it be a strange bug in the from_json UDF?

Try <array<struct<items:map>>

regards “Items” issue: just changing array<struct> to array<struct> does not help… when I change “Items” to “items” in the json as well, then it start working… since I cannot fix json to use lower case, I should lowercase it in the hive query… Thanks!

Can the UDF read multiline JSON file, means each record if of multiple lines

Hello everybody,

I have a JSON object as the following:

‘{“key1″:”value1″,”key2″:”value2″,”key3″:”value3″,”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}’

I would like to convert this into a map object using from_json. I tried the following:

select from_json(‘{“key1″:”value1″,”key2″:”value2″,”key3″:”value3″,”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}’, ‘map<string,array<array>>’) from my_table limit 1;

which returns:
{“key1″:[],”key3″:[],”key2″:[],”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}

As one can see the function correctly converts the value of “key4”, but does not recognize the values of the other keys, they are all empty.

If I try the following the vallues of all the keya are recognized but of “key4”:

select from_json(‘{“key1″:”value1″,”key2″:”value2″,”key3″:”value3″,”key4”:[[“0″,”1″,”nnn”],[“1″,”3″,”mmm”],[“1″,”3″,”ggg”],[“1″,”5″,”kkk”],[“4″,”5″,”ppp”]]}’, ‘map’) from my_table limit 1;
{“key1″:”value1″,”key3″:”value3″,”key2″:”value2″,”key4”:null}

What can I do so that my map object contains all the values from my json object?

I have also posted the question on stackoverflow:
http://stackoverflow.com/questions/33350042/convert-json-with-different-types-of-objects-strings-array-to-map

Thanks in advance!

Andy, Please check out my response on stackoverflow. Basically you need to define the schema as a named_struct, and specify the types of each of the keys independently.

This blog is very helpful to me,thankyou & I hope it helps the people who don’t know hive json.

This blog is very helpful and the library of UDFs are very useful.
One quick question/feedback on the named_struct, would it be possible to have it self discover the schema of the JSON structure? I have a JSON documents where I may not know the available keys beforehand and the values could be different types (arrays, strings, maps). So, I can’t use from_json or named_struct.

Hi,
Thanks for this guide. Very helpful!

I want to save the output of my to_json function into a JSON file. Could you please help me with this?

If you want to create a local file, try “INSERT OVERWRITE LOCAL DIRECTORY my_json SELECT to_json( …) FROM my_table; “

Hi
I am looking forward to use the UDFS provided by you . I have a RDBMS table that I need to bring it to Hadoop . The source table has one column as JSON and other columns of primitive type . As sample of the data looks like this .
{
“col1” : “INCOMING”,
“col2” : “2”,
“col3” : “test”,
“col4” : “LBHON0540963-073”
}
,12 , 20 , lncell ,test|12|go,1496729457607

where first coilumn is a json string and rest of the columns are of mixed type ( string , int etc), the last column being the time stamp . With the other columns mixed with json string I am facing a challenge in getting it into hadoop .
Below are the things I have tried.
1.Tried with just the json column by creating an avro table in hadoop and it works fine , but since I need to sqoop the records on daily basis , I will have to use the time stamp column .
2. I tried to use the json serde (https://github.com/rcongiu/Hive-JSON-Serde) but was getting NULLs.
Would appreciate your guidance on handling this kind of scenario using your UDFs.
Regards
Vinay

Vinay,
This is the case where a UDF makes more sense than a SerDe, because a SerDe must interpret the entire row. In your case, simply create a TEXT table in Hive, delimited by “,”. Then use “json_map”, or “from_json” UDF’s when selecting from the first column of the table.

CREATE TABLE source(
json STRING,
x INT, y INT … )
DELIMITED BY “,” …

SELECT json_map( json ) FROM source ….

Something like that should work.

Thanks so much Jerome for such a prompt reply 🙂
Will try as you have suggested and update here .
Thanks again
Vinay

	select json_val
	from
	mytable
	lateral view explode(
	json_split(' [ "val1", "val2", "val3" ] ')
	) x as json_val;

	Output is
	val1
	val2
	val3

	select
	json_key,
	json_val
	from
	mytable
	lateral view explode_map(
	json_map(' { "key1":1.1, "key2":34.5, "key3":6.7 }', 'string,double')
	) x as json_key, json_val;

	Output is:
	key1 1.1
	key2 34.5
	key4 6.7

	select
	to_json( array( 1, 2, 3, 4) )
	from mytable;

	Output is
	" [ 1,2,3,4 ]"

	select
	to_json( named_struct("key1", 0, "key2", "a string val", "key3", array( 4,5,6), "key4", map("a",2.3,"b",5.6) ))
	from mytable;

	Output is
	' { "key1":0, "key2":"a string val", "key3":[4,5,6], "key4":{"a":2.3, "b":5.6} } '

	select
	to_json(
	named_struct("camel_case", 1,
	"big_double", 9999.99,
	"the_really_cool_value", "a string val",
	"another_array", array( 4,5,6),
	"random_map", map("a",2.3,"b",5.6) ), true)
	from mytable;

	Output:
	{"camelCase":1,"bigDouble":9999.99,"theReallyCoolValue":"a string val","anotherArray":[4,5,6],"randomMap":{"b":5.6,"a":2.3}}

Hive and JSON made simple

`json_split` and `json_map`

`from_json`

JSON maps and `named_struct`‘s

`to_json`

Converting to and from CamelCase and snake_case

32 Responses to Hive and JSON made simple

Leave a reply to jeromebanks Cancel reply

Recent Posts

Archives

Categories

Meta

	select
	from_json(' { "key1":[0,1,2], "key2":[3,4,5,6], "key3":[7,8,9] } ',
	map( "", array( cast(0 as int))))
	from mytable;

	select
	from_json(' { "key1":[0,1,2], "key2":[3,4,5,6], "key3":[7,8,9] } ',
	'map<string,array<int>>')
	from mytable;

	select
	from_json(' { "key1":"string value", "key2":2.1, "key3":560000 }',
	named_struct("key1", "", "key2", cast(0.0 as double), "key3", cast(1 as bigint) ) )
	from mytable;

Hive and JSON made simple

json_split and json_map

from_json

JSON maps and named_struct‘s

to_json

Converting to and from CamelCase and snake_case

Share this:

Related

32 Responses to Hive and JSON made simple

Leave a reply to jeromebanks Cancel reply

Recent Posts

Archives

Categories

Meta

`json_split` and `json_map`

`from_json`

JSON maps and `named_struct`‘s

`to_json`