tsv_utils.tsv_sample

Command line tool for randomizing or sampling lines from input streams. Several sampling methods are available, including simple random sampling, weighted random sampling, Bernoulli sampling, and distinct sampling.

Copyright (c) 2017-2019, eBay Software Foundation Initially written by Jon Degenhardt

Members

Aliases

HasRandomValue
alias HasRandomValue = Flag!"hasRandomValue"

HasRandomValue is a boolean flag used at compile time by identifyFileLines to distinguish use cases needing random value assignments from those that don't.

Functions

bernoulliSampling
void bernoulliSampling(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Bernoulli sampling of lines from the input stream.

bernoulliSamplingCommand
void bernoulliSamplingCommand(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Invokes the appropriate Bernoulli sampling routine based on the command line arguments.

bernoulliSkipSampling
void bernoulliSkipSampling(TsvSampleOptions cmdopt, OutputRange outputStream)

bernoulliSkipSampling is an implementation of Bernoulli sampling using skips.

distinctSampling
void distinctSampling(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Sample a subset of lines by choosing a random set of values from key fields.

formatRandomValue
void formatRandomValue(auto ref OutputRange outputStream, double value)

Write a floating point random value to an output stream.

generateWeightedRandomValuesInorder
void generateWeightedRandomValuesInorder(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Generates weighted random values for all input lines, preserving input order.

identifyFileLines
InputLine!hasRandomValue[] identifyFileLines(const ref FileData[] fileData, TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

identifyFileLines is used by algorithms that read all files into memory prior to processing. It does the initial processing of the file data.

main
int main(string[] cmdArgs)

Main program.

randomizeLinesCommand
void randomizeLinesCommand(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

This routine is invoked when all input lines are being randomized. It selects the appropriate function and template instantiation based on the command line arguments.

randomizeLinesViaShuffle
void randomizeLinesViaShuffle(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Randomize all the lines in files or standard input using a shuffling algorithm.

randomizeLinesViaSort
void randomizeLinesViaSort(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Randomize all the lines in files or standard input using assigned random weights and sorting.

reservoirSamplingAlgorithmR
void reservoirSamplingAlgorithmR(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Reservoir sampling via Algorithm R

reservoirSamplingCommand
void reservoirSamplingCommand(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Invokes the appropriate reservoir sampling routine based on the command line arguments.

reservoirSamplingViaHeap
void reservoirSamplingViaHeap(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Reservoir sampling using a heap. Both weighted and unweighted random sampling are supported.

simpleRandomSamplingWithReplacement
void simpleRandomSamplingWithReplacement(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Simple random sampling with replacement.

tsvSample
void tsvSample(TsvSampleOptions cmdopt, auto ref OutputRange outputStream)

Invokes the appropriate sampling routine based on the command line arguments.

Structs

FileData
struct FileData

A container and reader of data from a file or standard input.

InputLine
struct InputLine(HasRandomValue hasRandomValue)

An InputLine array is returned by identifyFileLines to represent each non-header line line found in a FileData array. The 'data' element contains the line. A 'randomValue' line is included if random values are being generated.

TsvSampleOptions
struct TsvSampleOptions

Container for command line options and derived data.

Meta