This is an implementation of a solution in awk. The data is 8GB of pseudo-random hex digits (actually a hex conversion of about 12 man pages, duplicated 3300 times). It is about 11 million lines averaging 725 bytes per line.
This is a timed execution.
Paul--) ls -l tdHuge.txt
-rw-r--r-- 1 paul paul 8006529300 Dec 24 22:38 tdHuge.txt
Paul--) ./rndSelect
inFile ./tdHuge.txt; Size 8006529300; Count 10000; Lth 200; maxIter 50; Db 1;
Iteration 1 needs 10000
Iteration 2 needs 2712
Overlap 9561: 7663038508 to 7663038657
Iteration 3 needs 728
Iteration 4 needs 195
Iteration 5 needs 50
Iteration 6 needs 11
Iteration 7 needs 2
Required 7 iterations
Reporting 10000 samples
real 2m3.326s
user 0m3.496s
sys 0m10.340s
Paul--) wc Result.txt
20000 20000 2068894 Result.txt
Paul--) head -n 8 Result.txt | cut -c 1-40
>1
5706C69636174656420696E666F726D6174696F6
>2
20737472696E672028696E207768696368206361
>3
20646F6573206E6F742067657420612068617264
>4
647320616E642073746F7265732E204966207468
Paul--) tail -n 8 Result.txt | cut -c 1-40
>9997
6F7374207369676E69666963616E7420646F7562
>9998
7472696E676F702D73747261746567793D616C67
>9999
865726520736F6D652066696C6573206D7573742
>10000
5726E65642E205768656E20746865202D66206F7
Paul--)
It requires iterations because it makes random probes into the file. If a probe overlaps an adjacent one, or a newline, then it is discarded and a smaller batch of new probes is made. With an average line length of 725 lines and a sample requirement of 200, almost 30% of probes will be too close to the end of a line to be acceptable. We don't know the average line length of the real data -- longer lines would improve the success ratio.
We also do not know whether the header lines (as noted in a previous related question of 04-Dec-2020) are still present in the file. But provided every header line is less than the sample length of 200, the header lines will be discarded (serendipity at its best).
The code is mainly GNU/awk (minimal bash) and has some comments. There is a lot of residual debug which can be hidden by setting Db=0 in the options.
#! /bin/bash
#.. Select random non-overlapping character groups from a file.
export LC_ALL="C"
#.. These are the optional values that will need to be edited.
#.. Command options could be used to set these from scripts arguments.
inFile="./tdHuge.txt"
outFile="./Result.txt"
Count=10000 #.. Number of samples.
Lth=200 #.. Length of each sample.
maxIter=50 #.. Prevents excessive attempts.
Size="$( stat -c "%s" "${inFile}" )"
Seed="$( date '+%N' )"
Db=1
#.. Extracts random non-overlapping character groups from a file.
Selector () {
local Awk='
#.. Seed the random number generation, and show the args being used.
BEGIN {
NIL = ""; NL = "\n"; SQ = "\047";
srand (Seed % PROCINFO["pid"]);
if (Db) printf ("inFile %s; Size %d; Count %d; Lth %d; maxIter %s; Db %s;\n",
inFile, Size, Count, Lth, maxIter, Db);
fmtCmd = "dd bs=%d count=1 if=%s iflag=skip_bytes skip=%d status=none";
}
#.. Constructs an array of random file offsets, replacing overlaps.
#.. Existing offsets are indexed from 1 to Count, deleting overlaps.
#.. Additions are indexed from -1 down to -N to avoid clashes.
function Offsets (N, Local, Iter, nSeek, Seek, Text, j) {
while (N > 0 && Iter < maxIter) {
++Iter;
if (Db) printf ("Iteration %3d needs %6d\n", Iter, N);
for (j = 1; j <= N; ++j) {
Seek[-j] = int ((Size - Lth) * rand());
Text[Seek[-j]] = getSample( Seek[-j], Lth);
if (Db7) printf ("Added %10d: \n", Seek[-j], Text[Seek[-j]]);
}
#.. Reindex in numeric order for overlap checking.
nSeek = asort (Seek);
if (Db7) for (j in Seek) printf ("%6d: %10d\n", j, Seek[j]);
#.. Discard offsets that overlap the next selection.
N = 0; for (j = 1; j < nSeek; ++j) {
if (Seek[j] + Lth > Seek[j+1]) {
if (Db) printf ("Overlap %6d: %10d to %10d\n",
j, Seek[j], Seek[j+1]);
++N; delete Text[Seek[j]]; delete Seek[j];
} else if (length (Text[Seek[j]]) < Lth) {
if (Db7) printf ("Short %6d: %10d\n",
j, Seek[j]);
++N; delete Text[Seek[j]]; delete Seek[j];
}
}
}
if (Iter >= maxIter) {
printf ("Failed with overlaps after %d iterations\n", Iter);
} else {
printf ("Required %d iterations\n", Iter);
Samples( nSeek, Seek, Text);
}
}
#.. Returns n bytes from the input file from position p.
function getSample (p, n, Local, cmd, tx) {
cmd = sprintf (fmtCmd, n, SQ inFile SQ, p);
if (Db7) printf ("cmd :%s:\n", cmd);
cmd | getline tx; close (cmd);
return (tx);
}
#.. Send samples to the output file.
function Samples (nSeek, Seek, Text, Local, j) {
printf ("Reporting %d samples\n", nSeek);
for (j = 1; j <= nSeek; ++j) {
printf (">%d\n%s\n", j, Text[Seek[j]]) > outFile;
}
close (outFile);
}
END { Offsets( Count); }
'
echo | awk -v Size="${Size}" -v inFile="${inFile}" \
-v outFile="${outFile}" -v Count="${Count}" -v Lth="${Lth}" \
-v maxIter="${maxIter}" \
-v Db="${Db}" -v Seed="${Seed}" -f <( printf '%s' "${Awk}" )
}
#.. Test.
time Selector