bernoulliSkipSampling is an implementation of Bernoulli sampling using skips.
Skip sampling works by skipping a random number of lines between selections. This
can be faster than assigning a random value to each line when the inclusion
probability is low, as it reduces the number of calls to the random number
generator. Both the random number generator and the log() function are called when
calculating the next skip size. These additional log() calls add up as the
inclusion probability increases.
Performance tests indicate the break-even point is about 4-5% (--prob 0.04) for
file-oriented line sampling. This is obviously environment specific. In the
environments this implementation has been tested in the performance improvements
remain small, less than 7%, even with an inclusion probability as low as 0.0001.
The algorithm does not assign random values to individual lines. This makes it
incompatible with random value printing. It is not suitable for compatibility mode
either. As an example, in compatibility mode a line selected with '--prob 0.2' should
also be selected with '--prob 0.3' (assuming the same random seed). Skip sampling
does not have this property.
The algorithm for calculating the skip size has been described by multiple sources.
There are two key variants depending on whether the total number of lines in the
data set is known in advance. (This implementation does not know the total.)
Useful references:
Jeffrey Scott Vitter, "An Efficient Algorithm for Sequential Random Sampling",
ACM Trans on Mathematical Software, 1987. On-line:
http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf
P.J. Haas, "Data-Stream Sampling: Basic Techniques and Results", from the book
"Data Stream Management", Springer-Verlag, 2016. On-line:
https://www.springer.com/cda/content/document/cda_downloaddocument/9783540286073-c2.pdf
Erik Erlandson, "Faster Random Samples With Gap Sampling", 2014. On-line:
http://erikerlandson.github.io/blog/2014/09/11/faster-random-samples-with-gap-sampling/
bernoulliSkipSampling is an implementation of Bernoulli sampling using skips.
Skip sampling works by skipping a random number of lines between selections. This can be faster than assigning a random value to each line when the inclusion probability is low, as it reduces the number of calls to the random number generator. Both the random number generator and the log() function are called when calculating the next skip size. These additional log() calls add up as the inclusion probability increases.
Performance tests indicate the break-even point is about 4-5% (--prob 0.04) for file-oriented line sampling. This is obviously environment specific. In the environments this implementation has been tested in the performance improvements remain small, less than 7%, even with an inclusion probability as low as 0.0001.
The algorithm does not assign random values to individual lines. This makes it incompatible with random value printing. It is not suitable for compatibility mode either. As an example, in compatibility mode a line selected with '--prob 0.2' should also be selected with '--prob 0.3' (assuming the same random seed). Skip sampling does not have this property.
The algorithm for calculating the skip size has been described by multiple sources. There are two key variants depending on whether the total number of lines in the data set is known in advance. (This implementation does not know the total.) Useful references: