tsv_sample

Command line tool for randomizing or sampling lines from input streams. Several sampling methods are available, including simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling.

Copyright (c) 2017-2018, eBay Software Foundation Initially written by Jon Degenhardt

Members

Functions

bernoulliSampling
void bernoulliSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Bernoulli sampling on the input stream. Each input line is a assigned a random value and output if less than the inclusion probability. The order of the lines is not changed.

distinctSampling
void distinctSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Sample a subset of the unique values from the key fields.

generateWeightedRandomValuesInorder
void generateWeightedRandomValuesInorder(TsvSampleOptions cmdopt, OutputRange outputStream)

Generates weighted random values for all input lines, preserving input order.

getFieldValue
T getFieldValue(C[] line, size_t fieldIndex, C delim, string filename, size_t lineNum)
Undocumented in source. Be warned that the author may not have intended to support it.
main
int main(string[] cmdArgs)
Undocumented in source. Be warned that the author may not have intended to support it.
randomizeLines
void randomizeLines(TsvSampleOptions cmdopt, OutputRange outputStream)

Randomize all the lines in files or standard input.

reservoirSampling
void reservoirSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

An implementation of reservior sampling. Both weighted and uniform random sampling are supported.

simpleRandomSamplingWithReplacement
void simpleRandomSamplingWithReplacement(TsvSampleOptions cmdopt, OutputRange outputStream)

Simple random sampling with replacement.

testTsvSample
void testTsvSample(string[] cmdArgs, string[][] expected)
Undocumented in source. Be warned that the author may not have intended to support it.
tsvSample
void tsvSample(TsvSampleOptions cmdopt, OutputRange outputStream)

Invokes the appropriate sampling routine based on the command line arguments.

Structs

TsvSampleOptions
struct TsvSampleOptions

Container for command line options.

Variables

helpText
auto helpText;
Undocumented in source.
helpTextVerbose
auto helpTextVerbose;
Undocumented in source.

Meta