reservoirSampling

An implementation of reservior sampling. Both weighted and uniform random sampling are supported.

Both weighted and uniform random sampling are implemented using the one-pass algorithm described by Pavlos Efraimidis and Paul Spirakis ("Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis, https://arxiv.org/abs/1012.0256). In the unweighted case weights are simply set to one.

The implementation uses a heap (priority queue) large enough to hold the desired number of lines. Input is read line-by-line, assigned a random value, and added to the heap. The role of the identify the lines with the highest assigned random values. Once the heap is full, adding a new line means dropping the line with the lowest score. A "min" heap used for this reason.

When done reading all lines, the "min" heap is in the opposite order needed for output. The desired order is obtained by removing each element one at at time from the heap. The underlying data store will have the elements in correct order.

Generating output in weighted order matters for several reasons: - For weighted sampling, it preserves the property that smaller valid subsets can be created by taking the first N lines. - For unweighted sampling, it ensures that all output permutations are possible, and are not influences by input order or the heap data structure used. - Order consistency when making repeated use of the same random seeds, but with different sample sizes.

There are use cases where only the selection set matters, for these some performance could be gained by skipping the reordering and simply printing the backing store array in-order, but making this distinction seems an unnecessary complication.

Notes: - In tsv-sample versions 1.2.1 and earlier this routine also supported randomization of all input lines. This was dropped in version 1.2.2 in favor of the approach used in randomizeLines. The latter has significant advantages given that all data data must be read into memory. - The unweighted case could be sped up by adopting what is commonly known as "Algorithm R" followed by a random walk on the resulting reservoir (e.g. std.random.randomCover in the D standard library). This is faster than reversing the heap prior to output. The downsides are that the result order would not be consistent with the other routines and that random number printing does not make sense. Order consistency matters only in the rare case when multiple randomizations are being done with the same static seed. For a description of Algorithm R see: https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R.

void

reservoirSampling

(

Flag!"isWeighted" isWeighted

OutputRange

)

(

TsvSampleOptions cmdopt

auto ref OutputRange outputStream

)

if (

isOutputRange!(OutputRange, char)

)

reservoirSampling

Meta

Source