immutable auto helpTextVerbose
Synopsis: tsv-sample [options] [file...]
Sample input lines or randomize their order. Several modes of operation
* Shuffling (the default): All input lines are output in random order. All
orderings are equally likely.
* Random sampling (--n|num N): A random sample of N lines are selected and
written to standard output. By default, selected lines are written in
random order. All sample sets and orderings are equally likely. Use
--i|inorder to write the selected lines in the original input order.
* Weighted random sampling (--n|num N, --w|weight-field F): A weighted
sample of N lines is produced. Weights are taken from field F. Lines are
output in weighted selection order. Use --i|inorder to write in original
input order. Omit --n|num to shuffle all lines (weighted shuffling).
* Sampling with replacement (--r|replace, --n|num N): All input lines are
read in, then lines are repeatedly selected at random and written out.
This continues until N lines are output. Individual lines can be written
multiple times. Output continues forever if N is zero or not provided.
* Bernoulli sampling (--p|prob P): A random subset of lines is selected
based on probability P, a 0.0-1.0 value. This is a streaming operation.
A decision is made on each line as it is read. Line order is not changed.
* Distinct sampling (--k|key-fields F, --p|prob P): Input lines are sampled
based on the values in the key fields. A subset of keys are chosen based
on the inclusion probability (a 'distinct' set of keys). All lines with
one of the selected keys are output. Line order is not changed.
Fields: Fields are specified by field number or name. Field names require
the input file to have a header line. Use '--help-fields' for details.
Sample size: The '--n|num' option controls the sample size for all
sampling methods. In the case of simple and weighted random sampling it
also limits the amount of memory required.
Controlling the random seed: By default, each run produces a different
randomization or sampling. Using '--s|static-seed' changes this so
multiple runs produce the same results. This works by using the same
random seed each run. The random seed can be specified using
'--v|seed-value'. This takes a non-zero, 32-bit positive integer. (A zero
value is a no-op and ignored.)
Memory use: Bernoulli sampling and distinct sampling make decisions on
each line as it is read, there is no memory accumulation. These algorithms
can run on arbitrary size inputs. Sampling with replacement reads all
lines into memory and is limited by available memory. Shuffling also reads
all lines into memory and is similarly limited. Random sampling uses
reservoir sampling, and only needs to hold the sample size (--n|num) in
memory. The input data can be of any length.
Weighted sampling: Weighted random sampling is done using an algorithm
described by Pavlos Efraimidis and Paul Spirakis. Weights should be
positive values representing the relative weight of the entry in the
collection. Counts and similar can be used as weights, it is *not*
necessary to normalize to a [0,1] interval. Negative values are not
meaningful and given the value zero. Input order is not retained, instead
lines are output ordered by the randomized weight that was assigned. This
means that a smaller valid sample can be produced by taking the first N
lines of output. For more info on the sampling approach see:
* Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling
* "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis
Printing random values: Most of the sampling algorithms work by generating
a random value for each line. (See "Compatibility mode" below.) The nature
of these values depends on the sampling algorithm. They are used for both
line selection and output ordering. The '--p|print-random' option can be
used to print these values. The random value is prepended to the line
separated by the --d|delimiter char (TAB by default). The
'--gen-random-inorder' option takes this one step further, generating
random values for all input lines without changing the input order. The
types of values currently used by these sampling algorithms:
* Unweighted sampling: Uniform random value in the interval [0,1]. This
includes Bernoulli sampling and unweighted line order randomization.
* Weighted sampling: Value in the interval [0,1]. Distribution depends on
the values in the weight field. It is used as a partial ordering.
* Distinct sampling: An integer, zero and up, representing a selection
group. The inclusion probability determines the number of selection groups.
* Sampling with replacement: Random value printing is not supported.
The specifics behind these random values are subject to change in future
Compatibility mode: As described above, many of the sampling algorithms
assign a random value to each line. This is useful when printing random
values. It has another occasionally useful property: repeated runs with
the same static seed but different selection parameters are more
compatible with each other, as each line gets assigned the same random
value on every run. For example, if Bernoulli sampling is run with
'--prob 0.2 --static-seed', then run again with '--prob 0.3 --static-seed',
all the lines selected in the first run will be selected in the second.
This comes at a cost: in some cases there are faster algorithms that don't
preserve this property. By default, tsv-sample will use faster algorithms
when available. However, the '--compatibility-mode' option switches to
algorithms that assign a random value per line. Printing random values
also engages compatibility mode.