Sample lines by choosing a random set of distinct keys formed from one or more
fields on each line.
Distinct sampling is a streaming form of sampling, similar to Bernoulli sampling.
However, instead of each line being subject to an independent trial, lines are
selected based on a key from each line. A portion of keys are randomly selected for
output, and every line containing a selected key is included in the output.
An example use-case is a query log having <user, query, clicked-url> triples. It is
often useful to sample records for portion of the users, but including all records
for the users selected. Distinct sampling supports this by selecting a subset of
users to include in the output.
Distinct sampling is done by hashing the key and mapping the hash value into
buckets sized to hold the inclusion probability. Records having a key mapping to
bucket zero are output. Buckets are equal size and therefore may be larger than the
inclusion probability. (The other approach would be to have the caller specify the
the number of buckets. More correct, but less convenient.)
Sample lines by choosing a random set of distinct keys formed from one or more fields on each line.
Distinct sampling is a streaming form of sampling, similar to Bernoulli sampling. However, instead of each line being subject to an independent trial, lines are selected based on a key from each line. A portion of keys are randomly selected for output, and every line containing a selected key is included in the output.
An example use-case is a query log having <user, query, clicked-url> triples. It is often useful to sample records for portion of the users, but including all records for the users selected. Distinct sampling supports this by selecting a subset of users to include in the output.
Distinct sampling is done by hashing the key and mapping the hash value into buckets sized to hold the inclusion probability. Records having a key mapping to bucket zero are output. Buckets are equal size and therefore may be larger than the inclusion probability. (The other approach would be to have the caller specify the the number of buckets. More correct, but less convenient.)