Exploding multiple arrays at the same time with numeric_range

Hive allows you to emit all the elements of an array into multiple rows using the explode UDTF, but there is no easy way to explode multiple arrays at the same time.

Say you have a table my_table which contains two array columns, both of the same size. (Say you had an ordered list of multiple values, possibly of different types). Then you need to have to explode the arrays, and have a row which contains the values from the two arrays.

This can be easily solved with Brickhouse’s numeric_range UDTF. It emits an integer according to a specified numeric range. It takes 1,2 or 3 arguments. For 1 argument, it emits an integer from 0 to n -1. For 2, it emits from the first argument to the second argument value – 1. For 3, it uses the third argument as an increment value.

The array_index UDF simply returns an array’s value at the i’th index. It is needed because currently Hive’s bracket [ ] operators support only constant values. ( See https://issues.apache.org/jira/browse/HIVE-1955 ). Brickhouse also contains a map_index UDF to return the value of a map for a particular key.

Advertisements
This entry was posted in Uncategorized. Bookmark the permalink.

7 Responses to Exploding multiple arrays at the same time with numeric_range

  1. SourabhP says:

    hi,
    Thanks for this post,
    Suppose I have complex XML which nested at multiple levels
    From this XML, I have extracted values using XPATH and stored those in a Hive view.

    For data exploding at level 1, the above method works perfectly.

    But suppose if we want to explode array at level 1 and level 2, how should we do that?
    (Since array size will be different in this case.)

    • jeromebanks says:

      Sourabh,
      I’m not sure I understand. I’m suggesting an approach where you need to explode multiple columns of array type in a table, and access the elements consistently.

      For nested structures, maybe you can just do multiple lateral views ? Hive should be able to do that. Could you post details to your question to stackoverflow, and we can see if Brickhouse can help out in that usecase ?

      • eljefewatson says:

        I’m interested in what I think is the same question, similar to what was answered on Stackoverflow (http://stackoverflow.com/questions/20667473/hive-explode-lateral-view-multiple-arrays?rq=1): I’m parsing XML and have , so I need to be able to access level1 name and the level2 names. The numeric_range() UDF using xpath(xmldata,’/tag/level1/level2/@name’) in a view works nicely giving all the level2 names, and I can use your FirstArrayUDF to get the “wow” from the 1st line. However, if the attribute values are missing, some of the arrays have varying lengths so I get the wrong attributes in the wrong rows. Also, if I have this:

        /

        It’s not obvious how to create the xpath and XML view query with a numeric_range() and array_index() UDFs to create the rows:

        one a
        one b
        two x
        two y

      • jeromebanks says:

        I’m not sure if I fully understand your question. I’m not very familiar with the xpath UDF ( or XPath in general, actually), but perhaps you need to do two different XPath queries; one for the level1 and one for the level2. Then you could do 2 lateral view explodes for each of the different arrays

        Something like

        SELECT array_index( level1, n1 ),
        array_index( level2, n2)
        FROM
        ( select xpath(<level1 queryy) as level1,
        xpath( ) as level2
        FROM mytable
        ) xp
        LATERAL VIEW numeric_range( size( level1 ) l1 AS n1
        LATERAL VIEW numeric_range( size(level2) l2 AS n2;

        Does that make sense ?

  2. KayPi' says:

    Please help with the following problem.

    myTable

    name id number
    Joe 20 72-11433.80-11427.14-6.66-R
    kay 115 3-547.86-547.86-0.00-R
    mary 85 12-0.49-0.49-0.00-R

    select * from myTable

    Joe 20 72-11433.80-11427.14-6.66-R
    kay 115 3-547.86-547.86-0.00-R
    mary 85 12-0.49-0.49-0.00-R

    select split(number,’\\-‘) from myTable

    [“72″,”11433.80″,”11427.14″,”6.66″,”R”]
    [“3″,”547.86″,”547.86″,”0.00″,”R”]
    [“12″,”0.49″,”0.49″,”0.00″,”R”]

    SELECT explode(split(number,’\\-‘)) FROM myTable
    72
    11433.80
    11427.14
    6.66
    R

    I want to know if there is a function that works like explode function
    but the desired output would be in columns NOT rows.
    such that:
    SELECT name, explode(split(number,’\\-‘)) FROM myTable

    Joe 72 11433.80 11427.14 6.66 R
    kay 3 547.86 547.86 0.00 R
    mary 12 0.49 0.49 0.00 R

    increase the number of columns NOT rows

    • jeromebanks says:

      Kay, Hive needs to know the size of the array ahead of time, because each row needs the same number of columns. With that constraint, you can use array_index or the [] syntax to access each value individually.

      SELECT array_index(split(…), 0), array_index(split(…), 1 ) , …etct..

  3. Pingback: Hive Explode / Lateral View multiple arrays | Zipkin Answers

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s