Random sampling command handler. Invokes the appropriate sampling routine based on
the command line arguments.
Random sampling selects a fixed size random sample from the input stream. Both
simple random sampling (equal likelihood) and weighted random sampling are
supported. Selected lines are output either in random order or original input order.
For weighted sampling the random order is the weighted selection order.
Two algorithms are used, reservoir sampling via a heap and reservoir sampling via
Algorithm R. This routine selects the appropriate reservoir sampling function and
template instantiation to based on the command line arguments.
Weighted sampling always uses the heap approach. Compatibility mode does as well,
as it is the method that uses per-line random value assignments. The implication
of compatibility mode is that a larger sample size includes all the results from
a smaller sample, assuming the same random seed is used.
For unweighted sampling there is a performance tradeoff between implementations.
Heap-based sampling is faster for small sample sizes. Algorithm R is faster for
large sample sizes. The threshold used was chosen based on performance tests. See
the reservoirSamplingAlgorithmR documentation for more information.
Random sampling command handler. Invokes the appropriate sampling routine based on the command line arguments.
Random sampling selects a fixed size random sample from the input stream. Both simple random sampling (equal likelihood) and weighted random sampling are supported. Selected lines are output either in random order or original input order. For weighted sampling the random order is the weighted selection order.
Two algorithms are used, reservoir sampling via a heap and reservoir sampling via Algorithm R. This routine selects the appropriate reservoir sampling function and template instantiation to based on the command line arguments.
Weighted sampling always uses the heap approach. Compatibility mode does as well, as it is the method that uses per-line random value assignments. The implication of compatibility mode is that a larger sample size includes all the results from a smaller sample, assuming the same random seed is used.
For unweighted sampling there is a performance tradeoff between implementations. Heap-based sampling is faster for small sample sizes. Algorithm R is faster for large sample sizes. The threshold used was chosen based on performance tests. See the reservoirSamplingAlgorithmR documentation for more information.