helpTextVerbose

Undocumented in source.
immutable
auto helpTextVerbose = q"EOS Synopsis: tsv-sample [options] [file...] Sample input lines or randomize their order. Several modes of operation are available: * Line order randomization (the default): All input lines are output in a random order. All orderings are equally likely. * Weighted line order randomization (--w|weight-field): Lines are selected using weighted random sampling, with the weight taken from a field. Lines are output in weighted selection order, reordering the lines. * Sampling with replacement (--r|replace, --n|num): All input is read into memory, then lines are repeatedly selected at random and written out. This continues until --n|num samples are output. Lines can be selected multiple times. Output continues forever if --n|num is zero or not specified. * Bernoulli sampling (--p|prob): A random subset of lines is output based on an inclusion probability. This is a streaming operation. A selection decision is made on each line as is it read. Lines order is not changed. * Distinct sampling (--k|key-fields, --p|prob): Input lines are sampled based on the values in the key field. A subset of the keys are chosen based on the inclusion probability (a 'distinct' set of keys). All lines with one of the selected keys are output. Line order is not changed. Sample size: The '--n|num' option limits the sample size produced. This speeds up line order randomization and weighted sampling significantly (details below). It is also used to terminate sampling with replacement. Controlling the random seed: By default, each run produces a different randomization or sampling. Using '--s|static-seed' changes this so multiple runs produce the same results. This works by using the same random seed each run. The random seed can be specified using '--v|seed-value'. This takes a non-zero, 32-bit positive integer. (A zero value is a no-op and ignored.) Memory use: Bernoulli sampling and distinct sampling make decisions on each line as it is read, so there is no memory accumulation. These algorithms support arbitrary size inputs. Sampling with replacement reads all lines into memory and is limited by available memory. The line order randomization algorithms hold the full output set in memory prior to generating results. This ultimately limits the size of the output set. For these memory needs can be reduced by using a sample size (--n|num). This engages reservoir sampling. Output order is not affected. Both 'tsv-sample -n 1000' and 'tsv-sample | head -n 1000' produce the same results, but the former is quite a bit faster. Weighted sampling: Weighted random sampling is done using an algorithm described by Pavlos Efraimidis and Paul Spirakis. Weights should be positive values representing the relative weight of the entry in the collection. Counts and similar can be used as weights, it is *not* necessary to normalize to a [0,1] interval. Negative values are not meaningful and given the value zero. Input order is not retained, instead lines are output ordered by the randomized weight that was assigned. This means that a smaller valid sample can be produced by taking the first N lines of output. For more info on the sampling approach see: * Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling * "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis (https://arxiv.org/abs/1012.0256) Printing random values: Most of the sampling algorithms work by generating a random value for each line. (See "Compatibility mode" below.) The nature of these values depends on the sampling algorithm. They are used for both line selection and output ordering. The '--p|print-random' option can be used to print these values. The random value is prepended to the line separated by the --d|delimiter char (TAB by default). The '--q|gen-random-inorder' option takes this one step further, generating random values for all input lines without changing the input order. The types of values currently used by these sampling algorithms: * Unweighted sampling: Uniform random value in the interval [0,1]. This includes Bernoulli sampling and unweighted line order randomization. * Weighted sampling: Value in the interval [0,1]. Distribution depends on the values in the weight field. It is used as a partial ordering. * Distinct sampling: An integer, zero and up, representing a selection group. The inclusion probability determines the number of selection groups. * Sampling with replacement: Random value printing is not supported. The specifics behind these random values are subject to change in future releases. Compatibility mode: As described above, many of the sampling algorithms assign a random value to each line. This is useful when printing random values. It has another occasionally useful property: repeated runs with the same static seed but different selection parameters are more compatible with each other, as each line gets assigned the same random value on every run. For example, if Bernoulli sampling is run with '--prob 0.2 --static-seed', then run again with '--prob 0.3 --static-seed', all the lines selected in the first run will be selected in the second. This comes at a cost: in some cases there are faster algorithms that don't preserve this property. By default, tsv-sample will use faster algorithms when available. However, the '--compatibility-mode' option switches to algorithms that assign a random value per line. Printing random values also engages compatibility mode. Options: EOS";

Meta