tsv_sample

Command line tool for randomizing or sampling lines from input streams. Several sampling methods are available, including simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling.

Copyright (c) 2017-2018, eBay Software Foundation Initially written by Jon Degenhardt

Members

Aliases

HasRandomValue
alias HasRandomValue = Flag!"hasRandomValue"

HasRandomValue is a boolean flag used at compile time by identifyFileLines to distinguish use cases needing random value assignments from those that don't.

Functions

bernoulliSampling
void bernoulliSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Bernoulli sampling on the input stream.

bernoulliSamplingCommand
void bernoulliSamplingCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Bernoulli sampling on the input stream.

bernoulliSkipSampling
void bernoulliSkipSampling(TsvSampleOptions cmdopt, OutputRange outputStream)
Undocumented in source. Be warned that the author may not have intended to support it.
distinctSampling
void distinctSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Sample a subset of the unique values from the key fields.

generateWeightedRandomValuesInorder
void generateWeightedRandomValuesInorder(TsvSampleOptions cmdopt, OutputRange outputStream)

Generates weighted random values for all input lines, preserving input order.

getFieldValue
T getFieldValue(C[] line, size_t fieldIndex, C delim, string filename, size_t lineNum)
Undocumented in source. Be warned that the author may not have intended to support it.
identifyFileLines
InputLine!hasRandomValue[] identifyFileLines(FileData[] fileData, TsvSampleOptions cmdopt, OutputRange outputStream)

identifyFileLines is used by algorithms that read all files into memory prior to processing. It does the initial processing of the file data.

main
int main(string[] cmdArgs)
Undocumented in source. Be warned that the author may not have intended to support it.
randomizeLinesCommand
void randomizeLinesCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Randomize all the lines in files or standard input.

randomizeLinesViaShuffle
void randomizeLinesViaShuffle(TsvSampleOptions cmdopt, OutputRange outputStream)

Randomize all the lines in files or standard input.

randomizeLinesViaSort
void randomizeLinesViaSort(TsvSampleOptions cmdopt, OutputRange outputStream)

Randomize all the lines in files or standard input.

reservoirSamplingAlgorithmR
void reservoirSamplingAlgorithmR(TsvSampleOptions cmdopt, OutputRange outputStream)

Reservoir sampling, Algorithm R

reservoirSamplingCommand
void reservoirSamplingCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Reservoir sampling on the input stream.

reservoirSamplingViaHeap
void reservoirSamplingViaHeap(TsvSampleOptions cmdopt, OutputRange outputStream)

Reservior sampling using a heap. Both weighted and unweighted random sampling are supported.

simpleRandomSamplingWithReplacement
void simpleRandomSamplingWithReplacement(TsvSampleOptions cmdopt, OutputRange outputStream)

Simple random sampling with replacement.

testTsvSample
void testTsvSample(string[] cmdArgs, string[][] expected)
Undocumented in source. Be warned that the author may not have intended to support it.
tsvSample
void tsvSample(TsvSampleOptions cmdopt, OutputRange outputStream)

Invokes the appropriate sampling routine based on the command line arguments.

Structs

FileData
struct FileData

A container and reader data form a file or standard input.

InputLine
struct InputLine(HasRandomValue hasRandomValue)

An InputLine array is returned by identifyFileLines to represent each non-header line line found in a FileData array. The 'data' element contains the line. A 'randomValue' line is included if random values are being generated.

TsvSampleOptions
struct TsvSampleOptions

Container for command line options.

Variables

helpText
auto helpText;
Undocumented in source.
helpTextVerbose
auto helpTextVerbose;
Undocumented in source.

Meta

License

Boost License 1.0 (http://boost.org/LICENSE_1_0.txt)