helpTextVerbose

Undocumented in source.
auto helpTextVerbose = q"EOS Synopsis: tsv-sample [options] [file...] Samples or randomizes input lines. There are several modes of operation: * Randomization (Default): Input lines are output in random order. * Stream sampling (--r|rate): Input lines are sampled based on a sampling rate. The order of the input is unchanged. * Distinct sampling (--k|key-fields, --r|rate): Sampling is based on the values in the key field. A portion of the keys are chosen based on the sampling rate (a distinct set). All lines with one of the selected keys are output. Input order is unchanged. * Weighted sampling (--w|weight-field): Input lines are selected using weighted random sampling, with the weight taken from a field. Input lines are output in the order selected, reordering the lines. See 'Weighted sampling' below for info on field weights. Sample size: The '--n|num' option limits the sample sized produced. This speeds up randomization and weighted sampling significantly (details below). Controlling randomization: Each run produces a different randomization. Using '--s|static-seed' changes this so multiple runs produce the same randomization. This works by using the same random seed each run. The random seed can be specified using '--v|seed-value'. This takes a non-zero, 32-bit positive integer. (A zero value is a no-op and ignored.) Generating random weights: The random weight assigned to each line can output using the '--p|print-random' option. This can be used with '--rate 1' to assign a random weight to each line. The random weight is prepended line as field one (separated by TAB or --d|delimiter char). Weights are in the interval [0,1]. The open/closed aspects of the interval (including/excluding 0.0 and 1.0) are subject to change and should not be relied on. Reservoir sampling: The randomization and weighted sampling cases are implemented using reservoir sampling. This means all lines output must be held in memory. Memory needed for large input streams can reduced significantly using a sample size. Both 'tsv-sample -n 1000' and 'tsv-sample | head -n 1000' produce the same results, but the former is quite a bit faster. Weighted sampling: Weighted random sampling is done using an algorithm described by Efraimidis and Spirakis. Weights should be positive values representing the relative weight of the entry in the collection. Counts and similar can be used as weights, it is *not* necessary to normalize to a [0,1] interval. Negative values are not meaningful and given the value zero. Input order is not retained, instead lines are output ordered by the randomized weight that was assigned. This means that a smaller valid sample can be produced by taking the first N lines of output. For more info on the sampling approach see: * Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling * "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis (https://arxiv.org/abs/1012.0256) Options: EOS";

Meta