tsv_utils.tsv_summarize

Command line tool that reads TSV files and summarizes field values associated with equivalent keys.

Copyright (c) 2016-2021, eBay Inc. Initially written by Jon Degenhardt

Members

Classes

CountOperator
class CountOperator

CountOperator counts the number of occurrences of each unique key, or the number of input lines if there is no unique key.

FirstOperator
class FirstOperator

FirstOperator outputs the first value found for the field.

KeySummarizerBase
class KeySummarizerBase(OutputRange)

KeySummarizerBase does work shared by the single key and multi-key summarizers.

LastOperator
class LastOperator

LastOperator outputs the last value found for the field.

MadOperator
class MadOperator

MadOperator produces the median absolute deviation from the median. This is a numeric operation.

MaxOperator
class MaxOperator

MaxOperator output the maximum value for the field. This is a numeric operator.

MeanOperator
class MeanOperator

MeanOperator produces the mean (average) of all the values. This is a numeric operator.

MedianOperator
class MedianOperator

MedianOperator produces the median of all the values. This is a numeric operator.

MinOperator
class MinOperator

MinOperator output the minimum value for the field. This is a numeric operator.

MissingCountOperator
class MissingCountOperator

MissingCountOperator generates the number of missing values. This overrides the global missingFieldsPolicy.

MissingFieldPolicy
class MissingFieldPolicy

This class describes processing behavior when a missing value is encountered.

ModeCountOperator
class ModeCountOperator

ModeCountOperator outputs the count of the most frequent value seen.

ModeOperator
class ModeOperator

ModeOperator outputs the most frequent value seen. In the event of a tie, the first value seen is produced.

MultiKeySummarizer
class MultiKeySummarizer(OutputRange)

This Summarizer is for the case where the unique key is based on multiple fields.

NoKeySummarizer
class NoKeySummarizer(OutputRange)

The NoKeySummarizer is used when summarizing values across the entire input.

NotMissingCountOperator
class NotMissingCountOperator

NotMissingCountOperator generates the number of not-missing values. This overrides the global missingFieldsPolicy.

OneKeySummarizer
class OneKeySummarizer(OutputRange)

This Summarizer is for the case where the unique key is based on exactly one field.

QuantileOperator
class QuantileOperator

QuantileOperator produces the value representing the data at a cummulative probability. This is a numeric operation.

RangeOperator
class RangeOperator

RangeOperator outputs the difference between the minimum and maximum values.

RetainOperator
class RetainOperator

RetainOperator retains the first occurrence of a field, without changing the header.

SharedFieldValues
class SharedFieldValues
Undocumented in source.
SingleFieldCalculator
class SingleFieldCalculator

SingleFieldCalculator is a base class for the common case of calculators using a single field. Derived classes implement processNextField() rather than processNextLine().

SingleFieldOperator
class SingleFieldOperator

SingleFieldOperator is a base class for single field operators, the most common Operator. Derived classes implement makeCalculator and the Calculator class it returns.

StDevOperator
class StDevOperator

Generates the standard deviation of the fields values. This is a numeric operator.

SumOperator
class SumOperator

SumOperator produces the sum of all the values. This is a numeric operator.

SummarizerBase
class SummarizerBase(OutputRange)

SummarizerBase performs work shared by all sumarizers, most everything except for handling of unique keys.

UniqueCountOperator
class UniqueCountOperator

UniqueCountOperator generates the number of unique values. Unique values are based on exact text match calculation, not a numeric comparison.

UniqueKeyValuesLists
class UniqueKeyValuesLists
Undocumented in source.
UniqueValuesOperator
class UniqueValuesOperator

UniqueValuesOperator outputs each unique value delimited by an alternate delimiter character. Values are output in the order seen.

ValuesOperator
class ValuesOperator

ValuesOperator outputs each value delimited by an alternate delimiter character.

VarianceOperator
class VarianceOperator

Generates the variance of the fields values. This is a numeric operator.

ZeroFieldCalculator
class ZeroFieldCalculator

ZeroFieldCalculator is a base class for operators that don't use fields as input. In particular, the Count operator. It is a companion to the ZeroFieldOperator class.

ZeroFieldOperator
class ZeroFieldOperator

ZeroFieldOperator is a base class for operators that take no input. The main use case is the CountOperator, which counts the occurrences of each unique key. Other uses are possible, for example, weighted random number assignment.

Functions

fieldHeaderFromIndex
string fieldHeaderFromIndex(size_t fieldIndex)

The default field header. This is used when the input doesn't have field headers, but field headers are used in the output. The default is "fieldN", where N is the 1-upped field number.

main
int main(string[] cmdArgs)
Undocumented in source. Be warned that the author may not have intended to support it.
summaryHeaderFromFieldHeader
string summaryHeaderFromFieldHeader(string fieldHeader, string operationName)

Produce a summary header from a field header.

testSingleFieldOperator
void testSingleFieldOperator(char[][][] splitFile, size_t fieldIndex, string headerSuffix, char[][] expectedValues, MissingFieldPolicy missingPolicy)

A helper for SingleFieldOperator unit tests.

testSingleFieldOperatorBase
void testSingleFieldOperatorBase(char[][][] splitFile, size_t fieldIndex, string headerSuffix, char[][] expectedValues, MissingFieldPolicy missingPolicy, T extraOpInitArgs)
Undocumented in source. Be warned that the author may not have intended to support it.
testSummarizer
void testSummarizer(string[] cmdArgs, string[][] file, string[][] expected)
Undocumented in source. Be warned that the author may not have intended to support it.
testZeroFieldOperator
void testZeroFieldOperator(char[][][] splitFile, string defaultHeader, char[][] expectedValues)
Undocumented in source. Be warned that the author may not have intended to support it.
tsvSummarize
void tsvSummarize(TsvSummarizeOptions cmdopt)

tsvSummarize does the primary work of the tsv-summarize program.

writeDataFile
void writeDataFile(string filepath, string[][] fileData, string delimiter)
Undocumented in source. Be warned that the author may not have intended to support it.

Interfaces

Calculator
interface Calculator

Calculators are responsible for the calculation of a single computation. They process each line and produce the final value when all processing is finished.

Operator
interface Operator

An Operator represents a summary calculation specified on the command line. e.g. '--mean 5'.

Summarizer
interface Summarizer(OutputRange)

A Summarizer object maintains the state of the summarization and performs basic processing. Handling of files and input lines is left to the caller.

Static variables

rt_options
string[] rt_options;
Undocumented in source but is binding to C. You might be able to learn more by searching the web for its name.

Structs

SummarizerPrintOptions
struct SummarizerPrintOptions

SummarizerPrintOptions holds printing options for Summarizers and Calculators. Typically specified with command line options, it is separated out for modularity.

TsvSummarizeOptions
struct TsvSummarizeOptions

Command line options - Container and processing. The processArgs method is used to process the command line.

Variables

helpText
auto helpText;
Undocumented in source.
helpTextVerbose
auto helpTextVerbose;
Undocumented in source.

Meta

License

Boost License 1.0 (http://boost.org/LICENSE_1_0.txt)