When you absolutely have to do a count distinct

Previously, I discussed how horrible it was to attempt to perform a count distinct in Hive; how it would cause you to sort the universe, and then wait until the end of time until a single reducer to complete. The standard solution is to avoid doing an exact count, and using some probabilistic data structure, like KMV sketches or HyperLogLogs to do a count estimate.

Sometimes, however, you really do need to have an exact count. For example, when doing some data QA on your pipelines, you want to make sure that you haven’t accidentally dropped any records, or have some faulty logic which somehow introduced extra records. In this case, you want to make sure that some exact counts match in the data inputs and outputs.

How can you avoid the evils of count distinct ??? This installment’s guest blogger, Prantik Bhattaccharyya, discusses how you can use Brickhouse’s group_count UDF, along with a prudent distribute, to save the universe.

Read about it on his newly pressed blog

This entry was posted in Uncategorized. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s