Use collect to avoid the self-join

Hive is the 5GL for MapReduce

One of the confusions in describing what Brickhouse is about, is that Hive has multiple purposes, and different uses for different people. It is analogous to SQL, but that is the trick. There are the traditional BI analytics, where one wants a simple syntax, easily understood by “non-technical” analysts, who want to do ad-hoc queries based on their current needs. Then there are also data-developers who use Hive primarily as a DSL, to quickly write data-flows and ETL transformations. Brickhouse can help the former, by providing some extra functionality not currently included in the Hive distribution. The primary audience however is the later, who want to use Hive as prototyping language for complex MapReduce dataflows. What previously was a difficult task writing straight Hadoop MapReduce code, can be developed with a handful of Hive querys.

Compare to C and C++, and C++ to Java. On the surface they share a lot of syntactical . But designing a real object-oriented C++ program is far different from writing traditional C code. And writing a garbage-collected Java program running in a JVM is far different from writing a C++ program.

Writing a Hive program isn’t writing SQL; it’s defining a Map-Reduce data pipeline. Optimizers will always continue to improve, but there will always be things which haven’t been optimized ‘yet’. Think about all the optimizations which were needed in the earlier days of Java to get adequate performance. Brickhouse is intended to help bridge that gap. It allows you to leverage all the Hive goodness and optimizations, and then handle all the edge-cases that we see in real world production environments.

Collapse your rows

A common use-case I’ve noticed is that one has a table with a row for values of different types, and one wants to collapse multiple rows together, to get one row with multiple columns. Hive provides UDTF’s to easily explode multiple rows, but not really a good way to collapse them. The common way to solve this is by doing a self-join, as described in the query below.

This makes sense to someone with a background with SQL and relational databases, but doesn’t really make sense with MapReduce and BigData. The join causes a sort ( or multiple sorts ) to occur, and effectively doubles the input to your job. For large tables, this could be very expensive.

Avoid the Join

A better way to do this, is to use collect() to aggregate the values into a map keyed by type, and then define specific columns as indexes into the map. Brickhouse provides a collect UDAF which returns either an array or a map, depending upon if one passes in one paramater or two.

Yes, this requires aggregation, but avoids potentially resorting the table more than once, which can be quite expensive, especially if many columns are collapsed. Map-side aggregation may also help in reducing the size of the sort, especially if all rows for an id are located close together. Yes, we may eventually introduce an optimization in Hive, so that a self-join is handled more effectively, but that is exactly the point. Brickhouse provides the tools for a data-developer to re-inforce the pipeline, while leveraging all the power which Hive gives us.

Posted in Uncategorized | Tagged , , | Leave a comment

Let’s start off with collect()

One the most fundamental functions in Brickhouse is the collect UDF. Based on collect_set, and similar to methods in Ruby and other languages, collect aggregates records into either an array or map from multiple rows. It is the opposite of an explode(). Hive has a mechanism to generate multiple rows from a collection with UDTFs like explode. Collect() helps you do the reverse, and collapse several rows into one.

For a simple example, suppose you just want to organize records together, for easy access. Imagine you have a table describing social activity; call it the action table with three columns user_id,action,actor_id and timestamp. You can aggregate a user’s associated actors together with something like …

select user_id, collect( actor_id )
from action
group by user_id;
....
1234 [ 23,34,12,54 ]
4564 [ 98, 87, 987, 4435, 343, 34, 4435 ]

This returns your user_id along with the list. (This will create duplicates if there are multiple actors per user).

Now let’s try something a little more complicated. Say you want to associate the actors with the actions. We can run something like…

select user_id,
collect( action, actor_list ) action_map
from (
select user_id, action, collect( actor) actor_list
from action
group by user_id, action ) al
group by user_id;
.....
1234 { "RETWEET":[23,34], "MENTION":[12,54] }
4564 { "RETWEET":[98,987,4435], "MENTION":[87,343,34,4435] }

In future posts, we’ll discuss how we can apply other Brickhouse UDFs to transform these arrays and maps into any form that you may need.

Posted in Uncategorized | Tagged , , , | Leave a comment

Welcome to Brickhouse Confessions

Hey, Folks, welcome to the new blog “Brickhouse Confessions”. ┬áThe purpose is mainly to provide guidance using the new Brickhouse library of Hive extensions, but it will also be a forum to discuss big data and new technologies.

Posted in Uncategorized | Leave a comment