tsv_utils.common.fieldlist

Utilities for parsing "field-lists" entered on the command line.

Field-lists

A "field-list" is entered on the command line to specify a set of fields for a command option. A field-list is a comma separated list of individual fields and "field-ranges". Fields are identified either by field number or by field names found in the header line of the input data. A field-range is a pair of fields separated by a hyphen and includes both the listed fields and all the fields in between.

Note: Internally, the comma separated entries in a field-list are called a field-group.

Fields-lists are parsed into an ordered set of one-based field numbers. Repeating fields are allowed. Some examples of numeric fields with the tsv-select tool:

$ tsv-select -f 3         # Field  3
$ tsv-select -f 3-5       # Fields 3,4,5
$ tsv-select -f 7,3-5     # Fields 7,3,4,5
$ tsv-select -f 3,5-3,5   # Fields 3,5,4,3,5

Fields specified by name must match a name in the header line of the input data. Glob-style wildcards are supported using the asterisk (*) character. When wildcards are used with a single field, all matching fields in the header are used. When used in a field range, both field names must match a single header field.

Consider a file data.tsv containing timing information:

$ tsv-pretty data.tsv
run  elapsed_time  user_time  system_time  max_memory
  1          57.5       52.0          5.5        1420
  2          52.0       49.0          3.0        1270
  3          55.5       51.0          4.5        1410

The header fields are:

1    run
2    elapsed_time
3    user_time
4    system_time
5    max_memory

Some examples using named fields for this file. (Note: -H turns on header processing):

$ tsv-select data.tsv -H -f user_time           # Field  3
$ tsv-select data.tsv -H -f run,user_time       # Fields 1,3
$ tsv-select data.tsv -H -f run-user_time       # Fields 1,2,3
$ tsv-select data.tsv -H -f '*_memory'          # Field  5
$ tsv-select data.tsv -H -f '*_time'            # Fields 2,3,4
$ tsv-select data.tsv -H -f '*_time,*_memory'   # Fields 2,3,4,5
$ tsv-select data.tsv -H -f '*_memory,*_time'   # Fields 5,2,3,4
$ tsv-select data.tsv -H -f 'run-*_time'        # Invalid range. '*_time' matches 3 fields

Both field numbers and fields names can both be used in the same field-list, except when specifying a field range:

$ tsv-select data.tsv -H -f 1,user_time         # Fields 1,3
$ tsv-select data.tsv -H -f 1-user_time         # Invalid range

A backslash is used to escape special characters occurring in field names. Characters that must be escaped when specifying them field names are: asterisk (*), comma(,), colon (:), space ( ), hyphen (-), and backslash (\). A backslash is also used to escape numbers that should be treated as field names rather than field numbers. Consider a file with the following header fields:

1    test id
2    run:id
3    time-stamp
4    001
5    100

These fields can be used in named field commands as follows:

$ tsv-select file.tsv -H -f 'test\ id'          # Field 1
$ tsv-select file.tsv -H -f 'run\:1'            # Field 2
$ tsv-select file.tsv -H -f 'time\-stamp'       # Field 3
$ tsv-select file.tsv -H -f '\001'              # Field 4
$ tsv-select file.tsv -H -f '\100'              # Field 5
$ tsv-select file.tsv -H -f '\001,\100'         # Fields 4,5
Note: The use of single quotes on the command line is necessary to avoid shell interpretation of the backslash character.

Fields lists are combined with other content in some command line options. The colon and space characters are both terminator characters for field-lists. Some examples:

$ tsv-filter -H --lt 3:100                        # Field 3 < 100
$ tsv-filter -H --lt elapsed_time:100             # 'elapsed_time' field < 100
$ tsv-summarize -H --quantile '*_time:0.25,0.75'  # 1st and 3rd quantiles for time fields

Field-list support routines identify the termination of the field-list. They do not do any processing of content occurring after the field-list.

Numeric field-lists

The original field-lists used in tsv-utils were numeric only. This is still the format used when a header line is not available. They are a strict subset of the field-list syntax described so above. Due to this history there are support routines that only support numeric field-lists. They are used by tools supporting only numeric field lists. They are also used by the more general field-list processing routines in this file when a named field or field range can be reduced to a numeric field-group.

Field-list utilities

The following functions provide the APIs for field-list processing:

  • parseFieldList - The main routine for parsing a field-list entered on the command line. It returns a range iterating over the field numbers represented by field-list. It handles both numeric and named field-lists and works with or without header lines. The range has a special member function that tracks how much of the original input range has been consumed.
  • parseNumericFieldList - This is a top-level routine for processing numeric field-lists entered on the command line. It was the original routine used by tsv-utils tools when only numeric field-lists where supported. It is still used in cases where only numeric field-lists are supported.
  • makeFieldListOptionHandler - Returns a delegate that can be passed to std.getopt for parsing numeric field-lists. It was part of the original code supporting numeric field-lists. Note that delegates passed to std.getopt do not have access to the header line of the input file, so the technique can only be used for numeric field-lists.
  • fieldListHelpText - A global variable containing help text describing the field list syntax that can be shown to end users.

The following private functions handle key parts of the implementation:

  • findFieldGroups - Range that iterates over the "field-groups" in a "field-list".
  • isNumericFieldGroup - Determines if a field-group is a valid numeric field-group.
  • isNumericFieldGroupWithHyphenFirstOrLast - Determines if a field-group is a valid numeric field-group, except for having a leading or trailing hyphen. This test is used to provide better error messages. A field-group that does not pass either isNumericFieldGroup or isNumericFieldGroupWithHyphenFirstOrLast is processed as a named field-group.
  • isMixedNumericNamedFieldGroup - determines if a field group is a range where one element is a field number and the other element is a named field (not a number). This is used for error handling.
  • namedFieldGroupToRegex - Generates regexes for matching field names in a field group to field names in the header line. One regex is generated for a single field, two are generated for a range. Wildcards and escape characters are translated into the correct regex format.
  • namedFieldRegexMatches - Returns an input range iterating over all the fields (strings) in a range matching a regular expression. It is used in conjunction with namedFieldGroupToRegex to find the fields in a header line matching a regular expression and map them to field numbers.
  • parseNumericFieldGroup - A helper function that parses a numeric field group (a string) and returns a range that iterates over all the field numbers in the field group. A numeric field-group is either a single number or a range. E.g. 5 or 5-8. This routine was part of the original code supporting only numeric field-lists.

Members

Aliases

AllowFieldNumZero
alias AllowFieldNumZero = Flag!"allowFieldNumZero"

The allowFieldNumZero flag is used as a template parameter controlling whether zero is a valid field. It is used by parseFieldList, parseNumericFieldList, and makeFieldListOptionHandler.

ConsumeEntireFieldListString
alias ConsumeEntireFieldListString = Flag!"consumeEntireFieldListString"

The consumeEntireFieldListString flag is used as a template parameter indicating whether the entire field-list string should be consumed. It is used by parseNumericFieldList.

ConvertToZeroBasedIndex
alias ConvertToZeroBasedIndex = Flag!"convertToZeroBasedIndex"

The convertToZeroBasedIndex flag is used as a template parameter controlling whether field numbers are converted to zero-based indices. It is used by parseFieldList, parseNumericFieldList, and makeFieldListOptionHandler.

OptionHandlerDelegate
alias OptionHandlerDelegate = void delegate(string option, string value)

OptionHandlerDelegate is the signature of the delegate returned by makeFieldListOptionHandler.

Functions

findFieldGroups
auto findFieldGroups(Range r)

findFieldGroups creates range that iterates over the 'field-groups' in a 'field-list'. (Private function.)

isMixedNumericNamedFieldGroup
bool isMixedNumericNamedFieldGroup(char[] fieldGroup)

isMixedNumericNamedFieldGroup determines if a field group is a range where one element is a field number and the other element is a named field (not a number).

isNumericFieldGroup
bool isNumericFieldGroup(char[] fieldGroup)

isNumericFieldGroup determines if a field-group is a valid numeric field-group. (Private function.)

isNumericFieldGroupWithHyphenFirstOrLast
bool isNumericFieldGroupWithHyphenFirstOrLast(char[] fieldGroup)

isNumericFieldGroupWithHyphenFirstOrLast determines if a field-group is a field number with a leading or trailing hyphen. (Private function.)

makeFieldListOptionHandler
OptionHandlerDelegate makeFieldListOptionHandler(T[] fieldsArray)

makeFieldListOptionHandler creates a std.getopt option handler for processing field-lists entered on the command line. A field-list is as defined by parseNumericFieldList.

namedFieldGroupToRegex
auto namedFieldGroupToRegex(char[] fieldGroup)

namedFieldGroupToRegex generates regular expressions for matching fields in named field-group to field names in a header line. (Private function.)

namedFieldRegexMatches
auto namedFieldRegexMatches(Range headerFields, Regex!char fieldRegex)

namedFieldRegexMatches returns an input range iterating over all the fields (strings) in an input range that match a regular expression. (Private function.)

parseFieldList
auto parseFieldList(string fieldList, bool hasHeader, string[] headerFields, string cmdOptionString, string headerCmdArg)

parseFieldList returns a range iterating over the field numbers in a field-list.

parseNumericFieldGroup
auto parseNumericFieldGroup(string fieldRange)

parseNumericFieldGroup parses a single number or number range. E.g. '5' or '5-8'. (Private function.)

parseNumericFieldList
auto parseNumericFieldList(string fieldList, char delim)

parseNumericFieldList lazily generates a range of fields numbers from a 'numeric field-list' string.

Variables

fieldListHelpText
auto fieldListHelpText;

fieldListHelpText is text intended display to end users to describe the field-list syntax.

Meta