helpTextVerbose
auto helpTextVerbose =
q"EOS
Synopsis: tsv-summarize [options] file [file...]
tsv-summarize reads tabular data files (tab-separated by default), tracks
field values for each unique key, and runs summarization algorithms. Consider
the file data.tsv:
Make Color Time
ford blue 131
chevy green 124
ford red 128
bmw black 118
bmw black 126
ford blue 122
The min and average times for each make is generated by the command:
$ tsv-summarize --header --group-by Make --min Time --mean Time data.tsv
This produces:
Make Time_min Time_mean
ford 122 127
chevy 124 124
bmw 118 122
Using '--group-by Make,Color' will group by both 'Make' and 'Color'.
Omitting the '--group-by' entirely summarizes fields for the full file.
The previous example uses field names to identify fields. Field numbers
can be used as well. The next two commands are equivalent:
$ tsv-summarize -H --group-by Make,Color --min Time --mean Time data.tsv
$ tsv-summarize -H --group-by 1,2 --min 3 --mean 3 data.tsv
The program tries to generate useful headers, but custom headers can be
specified. Example (using -g and -H shortcuts for --header and --group-by):
$ tsv-summarize -H -g 1 --min 3:Fastest --mean 3:Average data.tsv
Most operators take custom headers in a similarly way, generally following:
--<operator-name> FIELD[:header]
Operators can be specified multiple times. They can also take multiple
fields (though not when a custom header is specified). Examples:
--median 2,3,4
--median 2-5,7-11
--median elapsed_time,system_time,user_time
--median '*_time' # Wildcard. All fields ending in '_time'.
The quantile operator requires one or more probabilities after the fields:
--quantile run_time:0.25 # Quantile 1 of the 'run_time' field
--quantile 2:0.25 # Quantile 1 of field 2
--quantile 2-4:0.25,0.5,0.75 # Q1, Median, Q3 of fields 2, 3, 4
Summarization operators available are:
count range mad values
retain sum var unique-values
first mean stddev unique-count
last median mode missing-count
min quantile mode-count not-missing-count
max
Calculated numeric values are printed to 12 significant digits by default.
This can be changed using the '--p|float-precision' option. If six or less
it sets the number of significant digits after the decimal point. If
greater than six it sets the total number of significant digits.
Calculations hold onto the minimum data needed while reading data. A few
operations like median keep all data values in memory. These operations will
start to encounter performance issues as available memory becomes scarce. The
size that can be handled effectively is machine dependent, but often quite
large files can be handled.
Operations requiring numeric entries will signal an error and terminate
processing if a non-numeric entry is found.
Missing values are not treated specially by default, this can be changed
using the '--x|exclude-missing' or '--r|replace-missing' option. The former
turns off processing for missing values, the latter uses a replacement value.
Options:
EOS";
tsv_utils tsv_summarize
classesfunctionsinterfacesstatic variablesstructsvariables