Sample a subset of lines by choosing a random set of values from key fields.
Distinct sampling is a streaming form of sampling, similar to Bernoulli sampling. However, instead of each line being subject to an independent trial, lines are selected based on a key from each line. A portion of keys are randomly selected for output, and every line containing a selected key is included in the output.
An example use-case is a query log having <user, query, clicked-url> triples. It is often useful to sample records for portion of the users, but including all records for the users selected. Distinct sampling supports this by selecting the subset of users included in the output.
Distinct sampling is done by hashing the key and mapping the hash value into buckets matching the inclusion probability. Records having a key mapping to bucket zero are output.