Reservoir sampling using a heap. Both weighted and unweighted random sampling are supported.
The algorithm used here is based on the one-pass algorithm described by Pavlos Efraimidis and Paul Spirakis ("Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis, https://arxiv.org/abs/1012.0256). In the unweighted case weights are simply set to one.
The implementation uses a heap (priority queue) large enough to hold the desired number of lines. Input is read line-by-line, assigned a random value, and added to the heap. The role of the heap is to identify the lines with the highest assigned random values. Once the heap is full, adding a new line means dropping the line with the lowest score. A "min" heap used for this reason.
When done reading all lines, the "min" heap is in reverse of weighted selection order. Weighted selection order is obtained by removing each element one at at time from the heap. The underlying data store will have the elements in weighted selection order (largest weights first).
Generating output in weighted order is useful for several reasons: - For weighted sampling, it preserves the property that smaller valid subsets can be created by taking the first N lines. - For unweighted sampling, it ensures that all output permutations are possible, and are not influenced by input order or the heap data structure used. - Order consistency is maintained when making repeated use of the same random seed, but with different sample sizes.
The other choice is preserving input order. This is supporting by recording line numbers and sorting the selected sample.
There are use cases where only the selection set matters. For these some performance could be gained by skipping the reordering and simply printing the backing store array in-order. Performance tests indicate only a minor benefit, so this is not supported.