tsv_utils.tsv_split

Command line tool for splitting a files (or files) into multiple output files. Several methods for splitting are available, including splitting by line count, splitting by random assignment, and splitting by random assignment based on key fields.

Copyright (c) 2020-2021, eBay Inc. Initially written by Jon Degenhardt

Members

Functions

main
int main(string[] cmdArgs)

Main program.

rlimitCurrOpenFilesLimit
uint rlimitCurrOpenFilesLimit()

Get the rlimit current number of open files the process is allowed.

splitByLineCount
void splitByLineCount(TsvSplitOptions cmdopt, size_t readBufferSize)

Write input lines to multiple files, splitting based on line count.

splitLinesByKey
void splitLinesByKey(TsvSplitOptions cmdopt, SplitOutputFiles outputFiles)

Write input lines to multiple output files using fields as a random selection key.

splitLinesRandomly
void splitLinesRandomly(TsvSplitOptions cmdopt, SplitOutputFiles outputFiles)

Write input lines to multiple files, randomly selecting an output file for each line.

tsvSplit
void tsvSplit(TsvSplitOptions cmdopt)

Invokes the proper split routine based on the command line arguments.

Static variables

rt_options
string[] rt_options;
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.

Structs

SplitOutputFiles
struct SplitOutputFiles

A SplitOutputFiles struct holds a collection of output files.

TsvSplitOptions
struct TsvSplitOptions

Container for command line options and derived data.

Variables

helpText
auto helpText;
Undocumented in source.
helpTextVerbose
auto helpTextVerbose;
Undocumented in source.

Meta

License

Boost License 1.0 (http://boost.org/LICENSE_1_0.txt)