A common problem I’ve seen in MapReduce for advertising analytics is calculating the number of unique values in a large data set. Usually the unique value represents a viewer, or cookie, or user. From a business end, it matters a lot if there is one visitor making a million impressions on your site, or a million visitors making one impression. This is where the term “reach” comes from; How many viewers does your website/application/tv show actually “reach” ???
For the sake of the examples, lets imagine that we have a table called “weblogs”, and that there is a column called “cookie”, which we’re trying to find the number of uniques, or the “reach”.
The Traditional Approach
So, How would you do this normally ??? Someone from a straight SQL background would probably try something like the following.
This may do the job, but let’s look at what it actually does
It sorts the entire table by the key “cookie”, and then outputs to one reducer. In order to count uniques, it scans records in sorted order, and if the current value changes, it increments its count. Given a large dataset, however, this will become very inefficient; first because you essentially have to resort all your data, and second, because the job cannot utilize more than one reducer.
Enter the SketchSet
So how can the Brickhouse help us out here ? The Brickhouse provides a simple reach estimator, called a sketch set. It is similar to Min-Hash approaches, but basically by saving the K lowest hashes, we can estimate the size of the dataset. Here is the same query without distinct.
sketch_set is a UDAF which returns a representation of the data-set in the form of a
array<string> ( which caps out at 5000 values. More on that later ). The UDF
estimated_reach then evaluates the representation, and returns an approximation which is close to the actual number of uniques. In this case there is an aggregation, but you don’t have to sort your entire data-set.
Others can probably explain this more than I can, but essentially we are keeping track of the 5000 values which have the lowest MD5 hashes. This is generally known as a KMV sketch. If we have less than 5000 values, we just keep an array containing all of the values. If we have more than 5000 values, and we want to add a value, we check it’s MD5 hash. If the value is lower than the highest current value we have, we add it to the array, and then remove the value with the previous highest hash. In the array, we store the actual value, sorted in the order of its’ MD5 hash.
(I chose 5000 as the default size, because it had the right trade-off of accuracy to space for my needs. Accuracy is generally more than 99% and rarely less then 97%. Also, for low reach sets with less than 5000 elements, we have an exact number, because we have the elements. Also, Spoiler Alert: Brickhouse now contains experimental HyperLogLog UDF’s thats to our friends at )
How does it work ???
It’s pretty difficult to convince people that this approach will work, but I like to think of it the following way. A strong hash function, like MD5, will distribute hashes evenly among the whole range of numbers, from -MAX_LONG to +MAX_LONG for a Java 64 bit hash, and will have very few hash collisions, where two values produce the same hash. As you add more and more values to your sketch, the value of the 5000th lowest hash gets smaller and smaller, and the space of occupied hashes become denser and denser. If you had a set of 2*MAX_LONG unique elements, and your hash function was perfect, then the 5000’th lowest hash function would be -MAX_LONG + 5000.
Pre-aggregate your sketches
Another thing I like about sketch sets, is that they can be easily combined. Imagine that you wanted to create a daily report which calculated your website’s reach for the past 90 days. Instead of scanning 90 days of logs everyday, we can just create a daily sketch, and then merge them over a rolling time window. The UDAF
union_sketch takes pre-generated sketch sets ( from the
sketch_set UDAF or the
convert_to_sketch UDF) and return a sketch set equivalent to the union of all those sets. Here is how you could estimate your 90 day reach:
If you get anything out of this posting, and out of Brickhouse, it is that you can easily avoid doing a
count distinct to count unique values, and hopefully save countless machine cycles, and countless hours waiting for results. There are data structures which do similar things, but I like sketch sets because they are simple to implement, and for low reach, you have the actual values. It caps the total amount of memory used, so that your jobs are sized roughly the same no matter how many uniques are actually counted. Since it is a list of 5000 strings, however, it might not be the best choice where one is overly concerned with the amount of storage or sort space. There are also some more fun tricks that you can do with them, which I will discuss in upcoming posts.