helpTextVerbose
immutable
auto helpTextVerbose =
q"EOS
Synopsis: tsv-sample [options] [file...]
Sample input lines or randomize their order. Several modes of operation
are available:
* Line order randomization (the default): All input lines are output in a
random order. All orderings are equally likely.
* Weighted line order randomization (--w|weight-field): Lines are selected
using weighted random sampling, with the weight taken from a field.
Lines are output in weighted selection order, reordering the lines.
* Sampling with replacement (--r|replace, --n|num): All input is read into
memory, then lines are repeatedly selected at random and written out. This
continues until --n|num samples are output. Lines can be selected multiple
times. Output continues forever if --n|num is zero or not specified.
* Bernoulli sampling (--p|prob): A random subset of lines is output based
on an inclusion probability. This is a streaming operation. A selection
decision is made on each line as is it read. Lines order is not changed.
* Distinct sampling (--k|key-fields, --p|prob): Input lines are sampled
based on the values in the key field. A subset of the keys are chosen
based on the inclusion probability (a 'distinct' set of keys). All lines
with one of the selected keys are output. Line order is not changed.
Sample size: The '--n|num' option limits the sample size produced. This
speeds up line order randomization and weighted sampling significantly
(details below). It is also used to terminate sampling with replacement.
Controlling the random seed: By default, each run produces a different
randomization or sampling. Using '--s|static-seed' changes this so
multiple runs produce the same results. This works by using the same
random seed each run. The random seed can be specified using
'--v|seed-value'. This takes a non-zero, 32-bit positive integer. (A zero
value is a no-op and ignored.)
Memory use: Bernoulli sampling and distinct sampling make decisions on
each line as it is read, so there is no memory accumulation. These
algorithms support arbitrary size inputs. Sampling with replacement reads
all lines into memory and is limited by available memory. The line order
randomization algorithms hold the full output set in memory prior to
generating results. This ultimately limits the size of the output set. For
these memory needs can be reduced by using a sample size (--n|num). This
engages reservoir sampling. Output order is not affected. Both
'tsv-sample -n 1000' and 'tsv-sample | head -n 1000' produce the same
results, but the former is quite a bit faster.
Weighted sampling: Weighted random sampling is done using an algorithm
described by Pavlos Efraimidis and Paul Spirakis. Weights should be
positive values representing the relative weight of the entry in the
collection. Counts and similar can be used as weights, it is *not*
necessary to normalize to a [0,1] interval. Negative values are not
meaningful and given the value zero. Input order is not retained, instead
lines are output ordered by the randomized weight that was assigned. This
means that a smaller valid sample can be produced by taking the first N
lines of output. For more info on the sampling approach see:
* Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling
* "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis
(https://arxiv.org/abs/1012.0256)
Printing random values: Most of the sampling algorithms work by generating
a random value for each line. (See "Compatibility mode" below.) The nature
of these values depends on the sampling algorithm. They are used for both
line selection and output ordering. The '--p|print-random' option can be
used to print these values. The random value is prepended to the line
separated by the --d|delimiter char (TAB by default). The
'--q|gen-random-inorder' option takes this one step further, generating
random values for all input lines without changing the input order. The
types of values currently used by these sampling algorithms:
* Unweighted sampling: Uniform random value in the interval [0,1]. This
includes Bernoulli sampling and unweighted line order randomization.
* Weighted sampling: Value in the interval [0,1]. Distribution depends on
the values in the weight field. It is used as a partial ordering.
* Distinct sampling: An integer, zero and up, representing a selection
group. The inclusion probability determines the number of selection groups.
* Sampling with replacement: Random value printing is not supported.
The specifics behind these random values are subject to change in future
releases.
Compatibility mode: As described above, many of the sampling algorithms
assign a random value to each line. This is useful when printing random
values. It has another occasionally useful property: repeated runs with
the same static seed but different selection parameters are more
compatible with each other, as each line gets assigned the same random
value on every run. For example, if Bernoulli sampling is run with
'--prob 0.2 --static-seed', then run again with '--prob 0.3 --static-seed',
all the lines selected in the first run will be selected in the second.
This comes at a cost: in some cases there are faster algorithms that don't
preserve this property. By default, tsv-sample will use faster algorithms
when available. However, the '--compatibility-mode' option switches to
algorithms that assign a random value per line. Printing random values
also engages compatibility mode.
Options:
EOS";
tsv_utils tsv_sample
aliasesfunctionsstatic variablesstructsvariables