helpTextVerbose

Undocumented in source.
immutable
auto helpTextVerbose = q"EOS Synopsis: tsv-sample [options] [file...] Sample input lines or randomize their order. Several modes of operation are available: * Shuffling (the default): All input lines are output in random order. All orderings are equally likely. * Random sampling (--n|num N): A random sample of N lines are selected and written to standard output. By default, selected lines are written in random order. All sample sets and orderings are equally likely. Use --i|inorder to write the selected lines in the original input order. * Weighted random sampling (--n|num N, --w|weight-field F): A weighted sample of N lines is produced. Weights are taken from field F. Lines are output in weighted selection order. Use --i|inorder to write in original input order. Omit --n|num to shuffle all lines (weighted shuffling). * Sampling with replacement (--r|replace, --n|num N): All input lines are read in, then lines are repeatedly selected at random and written out. This continues until N lines are output. Individual lines can be written multiple times. Output continues forever if N is zero or not provided. * Bernoulli sampling (--p|prob P): A random subset of lines is selected based on probability P, a 0.0-1.0 value. This is a streaming operation. A decision is made on each line as it is read. Line order is not changed. * Distinct sampling (--k|key-fields F, --p|prob P): Input lines are sampled based on the values in the key fields. A subset of keys are chosen based on the inclusion probability (a 'distinct' set of keys). All lines with one of the selected keys are output. Line order is not changed. Fields: Fields are specified by field number or name. Field names require the input file to have a header line. Use '--help-fields' for details. Sample size: The '--n|num' option controls the sample size for all sampling methods. In the case of simple and weighted random sampling it also limits the amount of memory required. Controlling the random seed: By default, each run produces a different randomization or sampling. Using '--s|static-seed' changes this so multiple runs produce the same results. This works by using the same random seed each run. The random seed can be specified using '--v|seed-value'. This takes a non-zero, 32-bit positive integer. (A zero value is a no-op and ignored.) Memory use: Bernoulli sampling and distinct sampling make decisions on each line as it is read, there is no memory accumulation. These algorithms can run on arbitrary size inputs. Sampling with replacement reads all lines into memory and is limited by available memory. Shuffling also reads all lines into memory and is similarly limited. Random sampling uses reservoir sampling, and only needs to hold the sample size (--n|num) in memory. The input data can be of any length. Weighted sampling: Weighted random sampling is done using an algorithm described by Pavlos Efraimidis and Paul Spirakis. Weights should be positive values representing the relative weight of the entry in the collection. Counts and similar can be used as weights, it is *not* necessary to normalize to a [0,1] interval. Negative values are not meaningful and given the value zero. Input order is not retained, instead lines are output ordered by the randomized weight that was assigned. This means that a smaller valid sample can be produced by taking the first N lines of output. For more info on the sampling approach see: * Wikipedia: https://en.wikipedia.org/wiki/Reservoir_sampling * "Weighted Random Sampling over Data Streams", Pavlos S. Efraimidis (https://arxiv.org/abs/1012.0256) Printing random values: Most of the sampling algorithms work by generating a random value for each line. (See "Compatibility mode" below.) The nature of these values depends on the sampling algorithm. They are used for both line selection and output ordering. The '--p|print-random' option can be used to print these values. The random value is prepended to the line separated by the --d|delimiter char (TAB by default). The '--gen-random-inorder' option takes this one step further, generating random values for all input lines without changing the input order. The types of values currently used by these sampling algorithms: * Unweighted sampling: Uniform random value in the interval [0,1]. This includes Bernoulli sampling and unweighted line order randomization. * Weighted sampling: Value in the interval [0,1]. Distribution depends on the values in the weight field. It is used as a partial ordering. * Distinct sampling: An integer, zero and up, representing a selection group. The inclusion probability determines the number of selection groups. * Sampling with replacement: Random value printing is not supported. The specifics behind these random values are subject to change in future releases. Compatibility mode: As described above, many of the sampling algorithms assign a random value to each line. This is useful when printing random values. It has another occasionally useful property: repeated runs with the same static seed but different selection parameters are more compatible with each other, as each line gets assigned the same random value on every run. For example, if Bernoulli sampling is run with '--prob 0.2 --static-seed', then run again with '--prob 0.3 --static-seed', all the lines selected in the first run will be selected in the second. This comes at a cost: in some cases there are faster algorithms that don't preserve this property. By default, tsv-sample will use faster algorithms when available. However, the '--compatibility-mode' option switches to algorithms that assign a random value per line. Printing random values also engages compatibility mode. Options: EOS";

Meta