Previously, I discussed how horrible it was to attempt to perform a
count distinct in Hive; how it would cause you to sort the universe, and then wait until the end of time until a single reducer to complete. The standard solution is to avoid doing an exact count, and using some probabilistic data structure, like KMV sketches or HyperLogLogs to do a count estimate.
Sometimes, however, you really do need to have an exact count. For example, when doing some data QA on your pipelines, you want to make sure that you haven’t accidentally dropped any records, or have some faulty logic which somehow introduced extra records. In this case, you want to make sure that some exact counts match in the data inputs and outputs.
How can you avoid the evils of
count distinct ??? This installment’s guest blogger, Prantik Bhattaccharyya, discusses how you can use Brickhouse’s
group_count UDF, along with a prudent
distribute, to save the universe.
Read about it on his newly pressed blog