reservoirSamplingAlgorithmR

Reservoir sampling via Algorithm R

This is an implementation of reservoir sampling using what is commonly known as "Algorithm R", credited to Alan Waterman by Donald Knuth in the "The Art of Computer Programming, Volume 2: Seminumerical Algorithms". More information about the algorithm can be found in Jeffrey Vitter's classic paper "Random Sampling with a Reservoir" (1985) as well as the Wikipedia article "Reservoir Sampling" (https://en.wikipedia.org/wiki/Reservoir_sampling#Algorithm_R).

Algorithm R is used for unweighted sampling without replacement. The heap-based algorithm in reservoirSamplingViaHeap is used for weighted sampling.

The classic algorithm stops after identifying the selected set of items. This implementation goes one step further and randomizes the order of the selected lines. This is consistent with shuffling (line order randomization), a primary tsv-sample use-case.

This algorithm is faster than reservoirSamplingViaHeap when the sample size (reservoir size) is large. Heap insertion is O(log k), where k is the sample size. Insertion in this algorithm is O(1). Similarly, generating the random order in the heap is O(k * log k), while in this algorithm the final randomization step is O(k).

This speed advantage may be offset a certain amount by using a more expensive random value generator. reservoirSamplingViaHeap generates values between zero and one, whereas reservoirSamplingAlgorithmR generates random integers over and ever growing interval. The latter is expected to be more expensive. This is consistent with performance tests indicating that reservoirSamplingViaHeap is faster when using small-to-medium size reservoirs and large input streams.

void

reservoirSamplingAlgorithmR

(

Flag!"preserveInputOrder" preserveInputOrder

OutputRange

)

(

ref TsvSampleOptions cmdopt

auto ref OutputRange outputStream

)

if (

isOutputRange!(OutputRange, char)

)

reservoirSamplingAlgorithmR

Meta

Source