tsv_utils.tsv_uniq

Command line tool that identifies equivalent lines in an input stream. Equivalent lines are identified using either the full line or a set of fields as the key. By default, input is written to standard output, retaining only the first occurrence of equivalent lines. There are also options for marking and numbering equivalent lines rather, without filtering out duplicates.

This tool is similar in spirit to the Unix 'uniq' tool, with some key differences. First, the key can be composed of individual fields, not just the full line. Second, input does not need to be sorted. (Unix 'uniq' only detects equivalent lines when they are adjacent, hence the usual need for sorting.)

There are a couple alternative to uniq'ing the input lines. One is to mark lines with an equivalence ID, which is a one-upped counter. The other is to number lines, with each unique key have its own set of numbers.

Copyright (c) 2015-2019, eBay Software Foundation Initially written by Jon Degenhardt

Members

Functions

main
int main(string[] cmdArgs)

Main program. Processes command line arguments and calls tsvUniq which implements the main processing logic.

tsvUniq
void tsvUniq(in TsvUniqOptions cmdopt, in string[] inputFiles)

Outputs the unique lines from all the input files.

Structs

TsvUniqOptions
struct TsvUniqOptions

Container for command line options.

Meta