helpTextVerbose
auto helpTextVerbose =
q"EOS
Synopsis: tsv-sample [options] [file...]
Samples or randomizes input lines. There are several modes of operation:
* Randomization (Default): Input lines are output in random order.
* Stream sampling (--r|rate): Input lines are sampled based on a sampling
rate. The order of the input is unchanged.
* Distinct sampling (--k|key-fields, --r|rate): Sampling is based on the
values in the key field. A portion of the keys are chosen based on the
sampling rate (a distinct set). All lines with one of the selected keys
are output. Input order is unchanged.
* Weighted sampling (--w|weight-field): Input lines are selected using
weighted random sampling, with the weight taken from a field. Input
lines are output in the order selected, reordering the lines. See
'Weighted sampling' below for info on field weights.
Sample size: The '--n|num' option limits the sample sized produced. This
speeds up randomization and weighted sampling significantly (details below).
Controlling randomization: Each run produces a different randomization.
Using '--s|static-seed' changes this so multiple runs produce the same
randomization. This works by using the same random seed each run. The
random seed can be specified using '--v|seed-value'. This takes a
non-zero, 32-bit positive integer. (A zero value is a no-op and ignored.)
Generating random weights: The random weight assigned to each line can
output using the '--p|print-random' option. This can be used with
'--rate 1' to assign a random weight to each line. The random weight
is prepended line as field one (separated by TAB or --d|delimiter char).
Weights are in the interval [0,1]. The open/closed aspects of the
interval (including/excluding 0.0 and 1.0) are subject to change and
should not be relied on.
Reservoir sampling: The randomization and weighted sampling cases are
implemented using reservoir sampling. This means all lines output must be
held in memory. Memory needed for large input streams can reduced
significantly using a sample size. Both 'tsv-sample -n 1000' and
'tsv-sample | head -n 1000' produce the same results, but the former is
quite a bit faster.
Weighted sampling: Weighted random sampling is done using an algorithm
described by Efraimidis and Spirakis. Weights should be positive values
representing the relative weight of the entry in the collection. Counts
and similar can be used as weights, it is *not* necessary to normalize to
a [0,1] interval. Negative values are not meaningful and given the value
zero. Input order is not retained, instead lines are output ordered by
the randomized weight that was assigned. This means that a smaller valid
sample can be produced by taking the first N lines of output. For more
info on the sampling approach see:
* Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling
* "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis
(https://arxiv.org/abs/1012.0256)
Options:
EOS";
tsv_sample
functionsstructsvariables