Utilities for parsing "field-lists" entered on the command line.
A "field-list" is entered on the command line to specify a set of fields for a command option. A field-list is a comma separated list of individual fields and "field-ranges". Fields are identified either by field number or by field names found in the header line of the input data. A field-range is a pair of fields separated by a hyphen and includes both the listed fields and all the fields in between.
Fields-lists are parsed into an ordered set of one-based field numbers. Repeating fields are allowed. Some examples of numeric fields with the tsv-select tool:
$ tsv-select -f 3 # Field 3 $ tsv-select -f 3-5 # Fields 3,4,5 $ tsv-select -f 7,3-5 # Fields 7,3,4,5 $ tsv-select -f 3,5-3,5 # Fields 3,5,4,3,5
Fields specified by name must match a name in the header line of the input data. Glob-style wildcards are supported using the asterisk (*) character. When wildcards are used with a single field, all matching fields in the header are used. When used in a field range, both field names must match a single header field.
Consider a file data.tsv containing timing information:
$ tsv-pretty data.tsv run elapsed_time user_time system_time max_memory 1 57.5 52.0 5.5 1420 2 52.0 49.0 3.0 1270 3 55.5 51.0 4.5 1410
The header fields are:
1 run 2 elapsed_time 3 user_time 4 system_time 5 max_memory
Some examples using named fields for this file. (Note: -H turns on header processing):
$ tsv-select data.tsv -H -f user_time # Field 3 $ tsv-select data.tsv -H -f run,user_time # Fields 1,3 $ tsv-select data.tsv -H -f run-user_time # Fields 1,2,3 $ tsv-select data.tsv -H -f '*_memory' # Field 5 $ tsv-select data.tsv -H -f '*_time' # Fields 2,3,4 $ tsv-select data.tsv -H -f '*_time,*_memory' # Fields 2,3,4,5 $ tsv-select data.tsv -H -f '*_memory,*_time' # Fields 5,2,3,4 $ tsv-select data.tsv -H -f 'run-*_time' # Invalid range. '*_time' matches 3 fields
Both field numbers and fields names can both be used in the same field-list, except when specifying a field range:
$ tsv-select data.tsv -H -f 1,user_time # Fields 1,3 $ tsv-select data.tsv -H -f 1-user_time # Invalid range
A backslash is used to escape special characters occurring in field names. Characters that must be escaped when specifying them field names are: asterisk (*), comma(,), colon (:), space ( ), hyphen (-), and backslash (\). A backslash is also used to escape numbers that should be treated as field names rather than field numbers. Consider a file with the following header fields:
1 test id 2 run:id 3 time-stamp 4 001 5 100
These fields can be used in named field commands as follows:
$ tsv-select file.tsv -H -f 'test\ id' # Field 1 $ tsv-select file.tsv -H -f 'run\:1' # Field 2 $ tsv-select file.tsv -H -f 'time\-stamp' # Field 3 $ tsv-select file.tsv -H -f '\001' # Field 4 $ tsv-select file.tsv -H -f '\100' # Field 5 $ tsv-select file.tsv -H -f '\001,\100' # Fields 4,5
Fields lists are combined with other content in some command line options. The colon and space characters are both terminator characters for field-lists. Some examples:
$ tsv-filter -H --lt 3:100 # Field 3 < 100 $ tsv-filter -H --lt elapsed_time:100 # 'elapsed_time' field < 100 $ tsv-summarize -H --quantile '*_time:0.25,0.75' # 1st and 3rd quantiles for time fields
Field-list support routines identify the termination of the field-list. They do not do any processing of content occurring after the field-list.
The original field-lists used in tsv-utils were numeric only. This is still the format used when a header line is not available. They are a strict subset of the field-list syntax described so above. Due to this history there are support routines that only support numeric field-lists. They are used by tools supporting only numeric field lists. They are also used by the more general field-list processing routines in this file when a named field or field range can be reduced to a numeric field-group.
The following functions provide the APIs for field-list processing:
The following private functions handle key parts of the implementation:
The consumeEntireFieldListString flag is used as a template parameter indicating whether the entire field-list string should be consumed. It is used by parseNumericFieldList.
OptionHandlerDelegate is the signature of the delegate returned by makeFieldListOptionHandler.
findFieldGroups creates range that iterates over the 'field-groups' in a 'field-list'. (Private function.)
isMixedNumericNamedFieldGroup determines if a field group is a range where one element is a field number and the other element is a named field (not a number).
isNumericFieldGroup determines if a field-group is a valid numeric field-group. (Private function.)
isNumericFieldGroupWithHyphenFirstOrLast determines if a field-group is a field number with a leading or trailing hyphen. (Private function.)
makeFieldListOptionHandler creates a std.getopt option handler for processing field-lists entered on the command line. A field-list is as defined by parseNumericFieldList.
namedFieldGroupToRegex generates regular expressions for matching fields in named field-group to field names in a header line. (Private function.)
namedFieldRegexMatches returns an input range iterating over all the fields (strings) in an input range that match a regular expression. (Private function.)
parseFieldList returns a range iterating over the field numbers in a field-list.
parseNumericFieldGroup parses a single number or number range. E.g. '5' or '5-8'. (Private function.)
parseNumericFieldList lazily generates a range of fields numbers from a 'numeric field-list' string.
fieldListHelpText is text intended display to end users to describe the field-list syntax.