tsv_utils.tsv_sample

Command line tool for randomizing or sampling lines from input streams. Several sampling methods are available, including simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling.

Copyright (c) 2017-2019, eBay Software Foundation Initially written by Jon Degenhardt

Members

Aliases

HasRandomValue
alias HasRandomValue = Flag!"hasRandomValue"

HasRandomValue is a boolean flag used at compile time by identifyFileLines to distinguish use cases needing random value assignments from those that don't.

Functions

bernoulliSampling
void bernoulliSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Bernoulli sampling of lines from the input stream.

bernoulliSamplingCommand
void bernoulliSamplingCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Invokes the appropriate Bernoulli sampling routine based on the command line arguments.

bernoulliSkipSampling
void bernoulliSkipSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

bernoulliSkipSampling is an implementation of Bernoulli sampling using skips.

distinctSampling
void distinctSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Sample a subset of lines by choosing a random set of values from key fields.

formatRandomValue
void formatRandomValue(OutputRange outputStream, double value)

Write a floating point random value to an output stream.

generateWeightedRandomValuesInorder
void generateWeightedRandomValuesInorder(TsvSampleOptions cmdopt, OutputRange outputStream)

Generates weighted random values for all input lines, preserving input order.

getFieldValue
T getFieldValue(C[] line, size_t fieldIndex, C delim, string filename, size_t lineNum)
Undocumented in source. Be warned that the author may not have intended to support it.
identifyFileLines
InputLine!hasRandomValue[] identifyFileLines(FileData[] fileData, TsvSampleOptions cmdopt, OutputRange outputStream)

identifyFileLines is used by algorithms that read all files into memory prior to processing. It does the initial processing of the file data.

main
int main(string[] cmdArgs)

Main program.

randomizeLinesCommand
void randomizeLinesCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

This routine is invoked when all input lines are being randomized. It selects the appropriate function and template instantiation based on the command line arguments.

randomizeLinesViaShuffle
void randomizeLinesViaShuffle(TsvSampleOptions cmdopt, OutputRange outputStream)

Randomize all the lines in files or standard input using a shuffling algorithm.

randomizeLinesViaSort
void randomizeLinesViaSort(TsvSampleOptions cmdopt, OutputRange outputStream)

Randomize all the lines in files or standard input using assigned random weights and sorting.

reservoirSamplingAlgorithmR
void reservoirSamplingAlgorithmR(TsvSampleOptions cmdopt, OutputRange outputStream)

Reservoir sampling via Algorithm R

reservoirSamplingCommand
void reservoirSamplingCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Invokes the appropriate reservoir sampling routine based on the command line arguments.

reservoirSamplingViaHeap
void reservoirSamplingViaHeap(TsvSampleOptions cmdopt, OutputRange outputStream)

Reservoir sampling using a heap. Both weighted and unweighted random sampling are supported.

simpleRandomSamplingWithReplacement
void simpleRandomSamplingWithReplacement(TsvSampleOptions cmdopt, OutputRange outputStream)

Simple random sampling with replacement.

testTsvSample
void testTsvSample(string[] cmdArgs, string[][] expected)
Undocumented in source. Be warned that the author may not have intended to support it.
tsvSample
void tsvSample(TsvSampleOptions cmdopt, OutputRange outputStream)

Invokes the appropriate sampling routine based on the command line arguments.

Static variables

rt_options
string[] rt_options;
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.

Structs

FileData
struct FileData

A container and reader of data from a file or standard input.

InputLine
struct InputLine(HasRandomValue hasRandomValue)

An InputLine array is returned by identifyFileLines to represent each non-header line line found in a FileData array. The 'data' element contains the line. A 'randomValue' line is included if random values are being generated.

TsvSampleOptions
struct TsvSampleOptions

Container for command line options and derived data.

Variables

helpText
auto helpText;
Undocumented in source.
helpTextVerbose
auto helpTextVerbose;
Undocumented in source.

Meta

License

Boost License 1.0 (http://boost.org/LICENSE_1_0.txt)