Author Archives: jeromebanks

Defeat the Titans with salt !!!

Last time, we discussed how you could use Brickhouse’s sketch set implementation to scalably handle counting uniques. Even with sketch sets, however, there are times when skew or unbalanced datasets can reak havoc with your jobs. Even when Hive uses … Continue reading

Posted in Uncategorized | Leave a comment

When you absolutely have to do a count distinct

Previously, I discussed how horrible it was to attempt to perform a count distinct in Hive; how it would cause you to sort the universe, and then wait until the end of time until a single reducer to complete. The … Continue reading

Posted in Uncategorized | 1 Comment

Hive and JSON made simple

It seems that JSON has become the lingua france for the Web 2.0 world. It’s simple, extendible, easily parsed by browsers, easily understood by humans, and so on. It’s no surprise then that a lot of our Big Data ETL … Continue reading

Posted in Big Data, Hive | Tagged , , , | 32 Comments

Brickhouse version 0.6.0 released !!!

We are proud to announce the 0.6.0 release of Brickhouse. It is available for download via the sonatype maven repositories, or as a pre-built bundle on the Downloads page. This release is mostly a bug-fix release, with some fixes to … Continue reading

Posted in Uncategorized | 2 Comments

Using sketch_set for reach estimation

A common problem I’ve seen in MapReduce for advertising analytics is calculating the number of unique values in a large data set. Usually the unique value represents a viewer, or cookie, or user. From a business end, it matters a … Continue reading

Posted in Uncategorized | Tagged , , , | Leave a comment

Squash the Long Tail with Brickhouse’s HBase UDFs

The problems we face with Data Science and Big Data often is that we often can express solutions to our problems in simple and elegant terms, which often works well with toy or limited datasets, but can never scale to … Continue reading

Posted in Uncategorized | Tagged , , , , , , | 4 Comments

Exploding multiple arrays at the same time with numeric_range

Hive allows you to emit all the elements of an array into multiple rows using the explode UDTF, but there is no easy way to explode multiple arrays at the same time. Say you have a table my_table which contains … Continue reading

Posted in Uncategorized | 7 Comments