Let’s start off with collect()

One the most fundamental functions in Brickhouse is the collect UDF. Based on collect_set, and similar to methods in Ruby and other languages, collect aggregates records into either an array or map from multiple rows. It is the opposite of an explode(). Hive has a mechanism to generate multiple rows from a collection with UDTFs like explode. Collect() helps you do the reverse, and collapse several rows into one.

For a simple example, suppose you just want to organize records together, for easy access. Imagine you have a table describing social activity; call it the action table with three columns user_id,action,actor_id and timestamp. You can aggregate a user’s associated actors together with something like …

select user_id, collect( actor_id )
from action
group by user_id;
1234 [ 23,34,12,54 ]
4564 [ 98, 87, 987, 4435, 343, 34, 4435 ]

This returns your user_id along with the list. (This will create duplicates if there are multiple actors per user).

Now let’s try something a little more complicated. Say you want to associate the actors with the actions. We can run something like…

select user_id,
collect( action, actor_list ) action_map
from (
select user_id, action, collect( actor) actor_list
from action
group by user_id, action ) al
group by user_id;
1234 { "RETWEET":[23,34], "MENTION":[12,54] }
4564 { "RETWEET":[98,987,4435], "MENTION":[87,343,34,4435] }

In future posts, we’ll discuss how we can apply other Brickhouse UDFs to transform these arrays and maps into any form that you may need.

This entry was posted in Uncategorized and tagged , , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s