Randomize all the lines in files or standard input using assigned random weights
and sorting.
All lines in files and/or standard input are read in and written out in random
order. This algorithm assigns a random value to each line and sorts. This approach
supports both weighted sampling and simple random sampling (unweighted).
This is significantly faster than heap-based reservoir sampling in the case where
the entire file is being read. See also randomizeLinesViaShuffle for the unweighted
case, as it is a little faster, at the cost not supporting random value printing or
compatibility-mode.
Input data size is limited by available memory. Disk oriented techniques are needed
when data sizes are larger. For example, generating random values line-by-line (ala
--gen-random-inorder) and sorting with a disk-backed sort program like GNU sort.
Randomize all the lines in files or standard input using assigned random weights and sorting.
All lines in files and/or standard input are read in and written out in random order. This algorithm assigns a random value to each line and sorts. This approach supports both weighted sampling and simple random sampling (unweighted).
This is significantly faster than heap-based reservoir sampling in the case where the entire file is being read. See also randomizeLinesViaShuffle for the unweighted case, as it is a little faster, at the cost not supporting random value printing or compatibility-mode.
Input data size is limited by available memory. Disk oriented techniques are needed when data sizes are larger. For example, generating random values line-by-line (ala --gen-random-inorder) and sorting with a disk-backed sort program like GNU sort.