Command line tool for shuffling or sampling lines from input streams. Several methods are available, including weighted and unweighted shuffling, simple and weighted random sampling, sampling with replacement, Bernoulli sampling, and distinct sampling.
Copyright (c) 2017-2021, eBay Inc. Initially written by Jon Degenhardt
HasRandomValue is a boolean flag used at compile time by identifyInputLines to distinguish use cases needing random value assignments from those that don't.
Bernoulli sampling of lines from the input stream.
Bernoulli sampling command handler. Invokes the appropriate Bernoulli sampling routine based on the command line arguments.
bernoulliSkipSampling is an implementation of Bernoulli sampling using skips.
Sample lines by choosing a random set of distinct keys formed from one or more fields on each line.
Write a floating point random value to an output stream.
Generate weighted random values for all input lines, preserving input order.
identifyInputLines is used by algorithms that read all files into memory prior to processing. It does the initial processing of the file data.
Random sampling command handler. Invokes the appropriate sampling routine based on the command line arguments.
Shuffle (randomize) all input lines using a shuffling algorithm.
Shuffle all input lines by assigning random weights and sorting.
Read data from one or more files. This routine is used by algorithms needing to read all data into memory.
Reservoir sampling via Algorithm R
Reservoir sampling using a heap. Both weighted and unweighted random sampling are supported.
Shuffling command handler. Invokes the appropriate shuffle (line order randomization) routine based on the command line arguments.
Simple random sampling with replacement.
Invokes the appropriate sampling routine based on the command line arguments.
A container holding data read from a file or standard input.
An InputLine array is returned by identifyInputLines to represent each non-header line line found in a FileData array. The 'data' element contains the line. A 'randomValue' line is included if random values are being generated.
Container for command line options and derived data.