Last time, we discussed how you could use Brickhouse’s sketch set implementation to scalably handle counting uniques.
Even with sketch sets, however, there are times when skew or unbalanced datasets can reak havoc with your jobs. Even when Hive uses map-side aggregation to reduce output, there are cases where a few “titans” can have orders of magnitude more data than most individuals.
Consider, for example, use-cases in social media, (perhaps in social networks similar to Twitter), where a few very famous celebrities are associated with the bulk of the data, and most users don’t produce that match. Imagine you needed to count all the unique retweeters for all Twitter users. Certain big stars, like Lady Gaga, or Justin Bieber, can have many millions of retweeters. Most people, however, don’t have many at all. Using sketch sets, you could write a query like the following:
This works fine for many data-sets, but with Twitter, the output from the few titans causes problems. Even though we don’t need to resort, like we would with
count distinct, the job is still unbalanced, because one or two reducers will get this huge titanic output. These few reducers will be running long after the others have completed.
We can solve this problem, however, by taking advantage of the fact that we can merge sketches together. We can use a technique known as “salting” your query. We add a random number we call a “salt” to our grouping. Then we group our query by user_id and this newly random salt. This breaks up the work for these titans into different groupings, distributing the data across different reducers. Next we need to merge all those salted sketches to get the actual value. To do this, we use the
union_sketch UDAF in Brickhouse, which aggregates sketch_sets together:
Since a sketch set is the set of 5000 items with the lowest MD5 hash values, they can be easily merged. Just merge all the values, and take the lowest 5000 hashes of the result. This can be used as an estimate of the size of the union of the two underlying sketches. ( There is also a
combine_sketch UDF in Brickhouse, which will combine sketches in a non-aggregate manner.)
This technique isn’t just useful for sketch sets. It can be applied to almost any situation where you are doing an aggregation across an unbalanced dataset. We could have done something similar if we were just doing simple counts. You just need to be able to save intermediate results in a format which can then be merged together in a sec