tsv_utils.tsv_sample

Command line tool for shuffling or sampling lines from input streams. Several methods are available, including weighted and unweighted shuffling, simple and weighted random sampling, sampling with replacement, Bernoulli sampling, and distinct sampling.

Copyright (c) 2017-2021, eBay Inc. Initially written by Jon Degenhardt

Members

Aliases

HasRandomValue
alias HasRandomValue = Flag!"hasRandomValue"

HasRandomValue is a boolean flag used at compile time by identifyInputLines to distinguish use cases needing random value assignments from those that don't.

Functions

bernoulliSampling
void bernoulliSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Bernoulli sampling of lines from the input stream.

bernoulliSamplingCommand
void bernoulliSamplingCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Bernoulli sampling command handler. Invokes the appropriate Bernoulli sampling routine based on the command line arguments.

bernoulliSkipSampling
void bernoulliSkipSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

bernoulliSkipSampling is an implementation of Bernoulli sampling using skips.

distinctSampling
void distinctSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

Sample lines by choosing a random set of distinct keys formed from one or more fields on each line.

formatRandomValue
void formatRandomValue(OutputRange outputStream, double value)

Write a floating point random value to an output stream.

generateWeightedRandomValuesInorder
void generateWeightedRandomValuesInorder(TsvSampleOptions cmdopt, OutputRange outputStream)

Generate weighted random values for all input lines, preserving input order.

getFieldValue
T getFieldValue(C[] line, size_t fieldIndex, C delim, string filename, ulong lineNum)

Undocumented in source. Be warned that the author may not have intended to support it.

identifyInputLines
InputLine!hasRandomValue[] identifyInputLines(InputBlock[] inputBlocks, TsvSampleOptions cmdopt)

identifyInputLines is used by algorithms that read all files into memory prior to processing. It does the initial processing of the file data.

main
int main(string[] cmdArgs)

Main program.

randomSamplingCommand
void randomSamplingCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Random sampling command handler. Invokes the appropriate sampling routine based on the command line arguments.

randomizeLinesViaShuffle
void randomizeLinesViaShuffle(TsvSampleOptions cmdopt, OutputRange outputStream)

Shuffle (randomize) all input lines using a shuffling algorithm.

randomizeLinesViaSort
void randomizeLinesViaSort(TsvSampleOptions cmdopt, OutputRange outputStream)

Shuffle all input lines by assigning random weights and sorting.

readFileData
InputBlock[] readFileData(TsvSampleOptions cmdopt, OutputRange outputStream)

Read data from one or more files. This routine is used by algorithms needing to read all data into memory.

reservoirSamplingAlgorithmR
void reservoirSamplingAlgorithmR(TsvSampleOptions cmdopt, OutputRange outputStream)

Reservoir sampling via Algorithm R

reservoirSamplingViaHeap
void reservoirSamplingViaHeap(TsvSampleOptions cmdopt, OutputRange outputStream)

Reservoir sampling using a heap. Both weighted and unweighted random sampling are supported.

shuffleCommand
void shuffleCommand(TsvSampleOptions cmdopt, OutputRange outputStream)

Shuffling command handler. Invokes the appropriate shuffle (line order randomization) routine based on the command line arguments.

simpleRandomSamplingWithReplacement
void simpleRandomSamplingWithReplacement(TsvSampleOptions cmdopt, OutputRange outputStream)

Simple random sampling with replacement.

testTsvSample
void testTsvSample(string[] cmdArgs, string[][] expected)

Undocumented in source. Be warned that the author may not have intended to support it.

tsvSample
void tsvSample(TsvSampleOptions cmdopt, OutputRange outputStream)

Invokes the appropriate sampling routine based on the command line arguments.

Structs

InputBlock
struct InputBlock

A container holding data read from a file or standard input.

InputLine
struct InputLine(HasRandomValue hasRandomValue)

An InputLine array is returned by identifyInputLines to represent each non-header line line found in a FileData array. The 'data' element contains the line. A 'randomValue' line is included if random values are being generated.

TsvSampleOptions
struct TsvSampleOptions

Container for command line options and derived data.

Variables

helpText
auto helpText;

Undocumented in source.

helpTextVerbose
auto helpTextVerbose;

Undocumented in source.

rt_options
string[] rt_options;

Undocumented in source.

Meta