Use collect to avoid the self-join

Hive is the 5GL for MapReduce

One of the confusions in describing what Brickhouse is about, is that Hive has multiple purposes, and different uses for different people. It is analogous to SQL, but that is the trick. There are the traditional BI analytics, where one wants a simple syntax, easily understood by “non-technical” analysts, who want to do ad-hoc queries based on their current needs. Then there are also data-developers who use Hive primarily as a DSL, to quickly write data-flows and ETL transformations. Brickhouse can help the former, by providing some extra functionality not currently included in the Hive distribution. The primary audience however is the later, who want to use Hive as prototyping language for complex MapReduce dataflows. What previously was a difficult task writing straight Hadoop MapReduce code, can be developed with a handful of Hive querys.

Compare to C and C++, and C++ to Java. On the surface they share a lot of syntactical . But designing a real object-oriented C++ program is far different from writing traditional C code. And writing a garbage-collected Java program running in a JVM is far different from writing a C++ program.

Writing a Hive program isn’t writing SQL; it’s defining a Map-Reduce data pipeline. Optimizers will always continue to improve, but there will always be things which haven’t been optimized ‘yet’. Think about all the optimizations which were needed in the earlier days of Java to get adequate performance. Brickhouse is intended to help bridge that gap. It allows you to leverage all the Hive goodness and optimizations, and then handle all the edge-cases that we see in real world production environments.

Collapse your rows

A common use-case I’ve noticed is that one has a table with a row for values of different types, and one wants to collapse multiple rows together, to get one row with multiple columns. Hive provides UDTF’s to easily explode multiple rows, but not really a good way to collapse them. The common way to solve this is by doing a self-join, as described in the query below.

This makes sense to someone with a background with SQL and relational databases, but doesn’t really make sense with MapReduce and BigData. The join causes a sort ( or multiple sorts ) to occur, and effectively doubles the input to your job. For large tables, this could be very expensive.

Avoid the Join

A better way to do this, is to use collect() to aggregate the values into a map keyed by type, and then define specific columns as indexes into the map. Brickhouse provides a collect UDAF which returns either an array or a map, depending upon if one passes in one paramater or two.

Yes, this requires aggregation, but avoids potentially resorting the table more than once, which can be quite expensive, especially if many columns are collapsed. Map-side aggregation may also help in reducing the size of the sort, especially if all rows for an id are located close together. Yes, we may eventually introduce an optimization in Hive, so that a self-join is handled more effectively, but that is exactly the point. Brickhouse provides the tools for a data-developer to re-inforce the pipeline, while leveraging all the power which Hive gives us.

This entry was posted in Uncategorized and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s