Command line tool that reads TSV files and summarizes field values associated with equivalent keys.
Copyright (c) 2016-2021, eBay Inc. Initially written by Jon Degenhardt
CountOperator counts the number of occurrences of each unique key, or the number of input lines if there is no unique key.
FirstOperator outputs the first value found for the field.
KeySummarizerBase does work shared by the single key and multi-key summarizers.
LastOperator outputs the last value found for the field.
MadOperator produces the median absolute deviation from the median. This is a numeric operation.
MaxOperator output the maximum value for the field. This is a numeric operator.
MeanOperator produces the mean (average) of all the values. This is a numeric operator.
MedianOperator produces the median of all the values. This is a numeric operator.
MinOperator output the minimum value for the field. This is a numeric operator.
MissingCountOperator generates the number of missing values. This overrides the global missingFieldsPolicy.
This class describes processing behavior when a missing value is encountered.
ModeCountOperator outputs the count of the most frequent value seen.
ModeOperator outputs the most frequent value seen. In the event of a tie, the first value seen is produced.
This Summarizer is for the case where the unique key is based on multiple fields.
The NoKeySummarizer is used when summarizing values across the entire input.
NotMissingCountOperator generates the number of not-missing values. This overrides the global missingFieldsPolicy.
This Summarizer is for the case where the unique key is based on exactly one field.
QuantileOperator produces the value representing the data at a cummulative probability. This is a numeric operation.
RangeOperator outputs the difference between the minimum and maximum values.
RetainOperator retains the first occurrence of a field, without changing the header.
SingleFieldCalculator is a base class for the common case of calculators using a single field. Derived classes implement processNextField() rather than processNextLine().
SingleFieldOperator is a base class for single field operators, the most common Operator. Derived classes implement makeCalculator and the Calculator class it returns.
Generates the standard deviation of the fields values. This is a numeric operator.
SumOperator produces the sum of all the values. This is a numeric operator.
SummarizerBase performs work shared by all sumarizers, most everything except for handling of unique keys.
UniqueCountOperator generates the number of unique values. Unique values are based on exact text match calculation, not a numeric comparison.
UniqueValuesOperator outputs each unique value delimited by an alternate delimiter character. Values are output in the order seen.
ValuesOperator outputs each value delimited by an alternate delimiter character.
Generates the variance of the fields values. This is a numeric operator.
ZeroFieldCalculator is a base class for operators that don't use fields as input. In particular, the Count operator. It is a companion to the ZeroFieldOperator class.
ZeroFieldOperator is a base class for operators that take no input. The main use case is the CountOperator, which counts the occurrences of each unique key. Other uses are possible, for example, weighted random number assignment.
The default field header. This is used when the input doesn't have field headers, but field headers are used in the output. The default is "fieldN", where N is the 1-upped field number.
Produce a summary header from a field header.
A helper for SingleFieldOperator unit tests.
tsvSummarize does the primary work of the tsv-summarize program.
Calculators are responsible for the calculation of a single computation. They process each line and produce the final value when all processing is finished.
An Operator represents a summary calculation specified on the command line. e.g. '--mean 5'.
A Summarizer object maintains the state of the summarization and performs basic processing. Handling of files and input lines is left to the caller.
SummarizerPrintOptions holds printing options for Summarizers and Calculators. Typically specified with command line options, it is separated out for modularity.
Command line options - Container and processing. The processArgs method is used to process the command line.