Reservoir sampling using a heap. Both weighted and unweighted random sampling are supported.
The algorithm used here is based on the one-pass algorithm described by Pavlos Efraimidis and Paul Spirakis ("Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis, https://arxiv.org/abs/1012.0256). In the unweighted case weights are simply set to one.
The implementation uses a heap (priority queue) large enough to hold the desired number of lines. Input is read line-by-line, assigned a random value, and added to the heap. The role of the heap is to identify the lines with the highest assigned random values. Once the heap is full, adding a new line means dropping the line with the lowest score. A "min" heap used for this reason.
When done reading all lines, the "min" heap is in the opposite order needed for output. The desired order is obtained by removing each element one at at time from the heap. The underlying data store will have the elements in correct order.
Generating output in weighted order matters for several reasons: - For weighted sampling, it preserves the property that smaller valid subsets can be created by taking the first N lines. - For unweighted sampling, it ensures that all output permutations are possible, and are not influences by input order or the heap data structure used. - Order consistency is maintained when making repeated use of the same random seed, but with different sample sizes.
There are use cases where only the selection set matters, for these some performance could be gained by skipping the reordering and simply printing the backing store array in-order, but making this distinction seems an unnecessary complication.