0

I have this script that extracts 200 random characters from a set:

#!/usr/bin/bash
n=$(stat -c "%s" newfile.txt)
r=$(shuf -i1-"$((n-200+1))" -n1)
< newfile.txt tail -c+"$r" | head -c200
for N in {1..10000}; do bash extraction_200pb.sh; done > output.txt 

I know shuf is very powerful but I want to include a sampling without replacement. This means that each 200 character extraction has only one chance to be selected when sampling.

Output should look like this:

>1     
GAACTCTACCAAAAGGTATGTTGCTTTCACAAAAAGCTGCATTCGATCATGTGTATAATCTAGCAAAACTAGTAGGAGGAGCAAAATACCCCGAAATTGTTGCTGCTCAGGCAATGCACGAATCAAACTACCTAGATCCTAGG
ACTAATAGTGTTTATAATGCCACAAATAGAACTAATGCTTTCGGTCAAACTGGTGAC
>2     
GCCTACCGCATAAAACAGCATCACCGCCACGGCTTCAGGGTATTCTCCAATGGCAAAGGCTCCCATGGTCGCGATGGACATTAAGAGAAATTCAGTAAAGAAATCTCCATTTAGAATACTTTTGAATCCTTCTTTTATCACCG
GAAAACCAACTGGGAGATAGGCCACAATGTACCAACCTACTCGCACCCAATCTGTAA
>3     
GCACGTGTCACCGTCAGCATCGCGGCAGCGGAACGGGTCACCCGGATTGCTGTCGGGACCATCGTTTACGCCGTCATTGTCGTTATCGGGATCGCCCGGATTACAAATGCCGTCGCCATCGACGTCGTTACCGTCGTTCGCGG
CATCGGGGAAGCCGGCACCGGCGGCACAGTCATCGCAACCGTCGCCATCGGCATCGA
>4     
GCGTTCGAAGCAATTGCACGAGACCCAAACAACGAATTGCTGGTTGTTGAACTGGAAAACTCTCTAGTCGGAATGCTTCAAATTACTTATATTCCCTACCTGACACATATTGGCAGTTGGCGTTGTCTTATAGAAGGTGTTCG
AATCCATAGTGACTATCGTGGACGAGGTTTTGGTGAGCAAATGTTCGCACATGCGAT
>5     
GTTTAAGACTAACAGCAATCTGTAAGGACATAGGTGCTGGAGTTGAGGTTAGTCTGGAAGATATGATCTGGGCAGAGAAATTGTCCAAAGCAAACACCGCAGCAAGAGGTATGCTAAACACAGCAAGAAGAATAAGTAATGAT
CCTACTGATTCTTTTCTGAATGAGTTGAATATAGGAGACCCCGACTCAACTCATCAT

The input file is a ~8G file that looks like this:

CCAAGATCGCTGGTTGGCGAATCAATTTCATAAACGCCTACGCTTTCAAGGAACGTGTTAAGAATGTTCT
GGCCGAGTTCCTTATGAGACGTTTCGCGTCCCTTAAATCGAATAACGACACGAACCTTGTCGCCGTCATT
AAGAAAACCCTTTGCCTTCTTGGCCTTAATCTGAATATCACGGGTGTCCGTTACAGGTCGCAACTGGATT
TCCTTGACTTCAGAAACAGACTTACGTGAATTCTTCTTGATTTCTTTCTGACGCTTTTCATTTTCATACT
GGAACTTGCCGTAATCAATGATCTTACAAACAGGAATATCACCCTTATCAGAGATCAATACCAAATCAAG
TTCGGCATCAAAAGCGCGATCAAGTGCGTCTTCAATGTCGAGGACCGTTGTTTCTTCACCGTCAACCAAA
CGAATTGTGGAGGACTTGATGTCGTCTCGGGTACTAATTTTATTCACGTATATGTTACTCCTTATGTTGT

Any help would be appreciated. Thanks in advance.

GSQ
  • 37
  • 5
  • Do you care for overlapping? For example: do you allow substrings: `(1,200)` and `(2,201)` to be included into the same selection? – thanasisp Dec 19 '20 at 20:52
  • Some details, please. [A] Do you want just 2 million chars (10000 samples x 200) in a single file, no separation? Or some separate files or other organisation? [B] Do you need options for the 10000 and the 200 values, or just for the input filename? I have the tested script (60 lines of bash/awk) but need to take out some debug and comment it a little. – Paul_Pedant Dec 19 '20 at 23:35
  • @Paul_Pedant thank you. A) Yes just a file is ok, I have this script that cuts in 200 characters, adds a ">" and a number to each chunk: cat file_with_random_200chars.txt | tr -d '\n' | dd cbs=200 conv=unblock | nl -n ln | tr '\t' '\n' | paste -d '>\n' /dev/null - - > output_file_with_numbers.txt. But if you have a better/simpler way, that would be great B) I dont understand what do you mean by "options"? I will like to provide an input file and from there, the random 200 extraction without replacement. Could you clarify a bit more your second question? :) Thank you Paul – GSQ Dec 20 '20 at 04:16
  • Options in the sense that the script could accept arguments for alternative values for the specific 10000 and 200. I'm concerned you are dealing with \n after the selection of the 200-byte samples -- that is going to realign the boundaries of the samples and the last sample will be short. – Paul_Pedant Dec 20 '20 at 10:36
  • You also had a header line at some point in the past that had to be excluded. Is that still present? – Paul_Pedant Dec 20 '20 at 10:42
  • @Paul_Pedant Oh I see, I guess it would be good to have the option to compare differente size (10,000 or other number) but the 200 will always remain the same. The header is now exluded.:) – GSQ Dec 20 '20 at 13:49
  • Dear @Paul_Pedant, Im so sorry I think overlapping is not that bad. It is okay if they overlap – GSQ Dec 20 '20 at 17:27
  • @thanasisp, I was thinking about it and its not terrible if they overlap – GSQ Dec 20 '20 at 17:30
  • @GSQ No problem. It is an interesting design problem for me, but I'm wondering if it is specialist enough that we should take it offline, or into chat. Efficiently finding a random selection of whole lines in a huge file, and a random substring of those lines, no overlaps, without a pre-process of the whole file? Really cool! That 5-command pipeline can all be done in the same awk process too. – Paul_Pedant Dec 20 '20 at 19:03
  • Sure! let me know! @Paul_Pedant. I would really appreciate any help – GSQ Dec 20 '20 at 19:22
  • @Paul_Pedant, just in case I added how the input file would look like – GSQ Dec 20 '20 at 19:31

2 Answers2

0

EDIT: I will add another answer with an implementation (which may need modification). This current answer is partly superseded, and I will probably want to delete it soon. Or perhaps update it, as a design note for the actual implementation.

Bare bones of an algorithm. It may not emit strictly random results statistically, but it does appear to satisfy the requirement. I would probably prototype this in awk (using the Cliff RNG, or possibly a read piped from shuf).

  1. Set N to be the number of offsets to be added to an array X (e.g. 10000).

  2. While N > 0 iterate 2a to 2d.
    2a. Append a block of N random offsets to X (using the range in your existing script).
    2b. Sort X, numerically ascending.
    2c. Traverse X comparing each offset versus the next, and marking for deletion any closer than 200 chars from its successor.
    2d. Delete the marked offsets, and set N to the number of deletions made.

  3. At this point we have 10000 random offsets (in ascending order) that do not overlap. (Need a test that the range is sufficiently sparse that stage 2 eventually exits -- e.g at least 3 * 10000 * 200 chars.)
    3a. Emit a series of 10000 tail | head commands (one per offset).
    3b. Pipe those commands through bash.

This needs to check initially that the range is sufficiently sparse, such that stage 2 eventually exits -- e.g at least 3 * 10000 * 200 chars. Some research on this may be needed.

EDIT: Checking for overlap between samples only requires the offsets, but checking for overlap with end of line needs the data. So at (2a), read the sample itself, and store it in an array indexed by the offset of the first byte. Check and delete short samples in (2c).

That makes the whole of (3) redundant, as we have already stored every valid sample in memory (and we have a sorted list of valid offsets to index them from). So we can just list all samples, in order of offset.

Paul_Pedant
  • 8,228
  • 2
  • 18
  • 26
  • Hi @Paul_Pedant! Thank you for your response I will check it. Just a few things: The input file will not have a header (just in case) and should look similar to the example provided (if that helps). Actually, the overlap is not a problem at all, even I would say that is very likely to happend. One important point here is that it should be a selection without replacement (the fact that each 200 character extraction has only one chance to be selected). – GSQ Dec 25 '20 at 15:29
  • The 200 chunks could be repeated along the main file but what I mean is that the random number wont pick the same coordinates twice for the 200 characters. So I wont be ending up with 2 sequences of identical 200 characters in the same position. For example the same 1,200 and 1,200 characters. Does that makes sense? – GSQ Dec 25 '20 at 15:30
  • @GSQ For both shuf and gawk, I am certain the cycle period of rand() is higher than 10000, so you would never get any exact repeats on that many samples. (I might set up tests for both to set a lower limit on cycle length, but I am guessing both will actually be longer than your 8 billion file size too.) My posted implementation delivers 10,000 samples such that (a) each sample will contain only text from a single row, and (b) no sample will overlap any other sample. – Paul_Pedant Dec 26 '20 at 12:00
  • Researched duplication in RNGs. [A] `shuf` explicitly excludes dups (unless you allow --repeat). The penalty for this is that it keeps a list of used values in RAM: 8 billion throws `shuf: memory exhausted`. [B] Standard `GNU/awk` outputs doubles between 0 and 1. Asking for a million random values with 16-digit decimal precision output gives me around 250 dups, but it does not appear to be periodic. For 10000 values, 100 tests never gave me a duplicate. [C] `GNU/awk` using the Cliff RNG does not give dups for a million values with 16-digit decimal precision (in 100 trials). – Paul_Pedant Dec 26 '20 at 21:52
  • Awesome!! this is great!! @Paul_Pedant – GSQ Dec 27 '20 at 04:28
  • Dear @Paul_Pedant. I love your script, its super cool! However, I want to compare my results with a script considering sampling with replacement.Therefore, the idea is to use both scripts. Could you kindly advice me how to modify your script in order to get this? Thanks in advance – GSQ Jan 20 '21 at 17:09
  • @GSQ I have been working on a version that would do both with and without replacement, but I need to clean it up. I tried some test cases, and found it broke with an exact overlap, and looped if it ran out of random selections. I can post a full update tomorrow. Meanwhile, your edit has unbalanced the syntax: you need to remove the `} else` that follows, too. I'm not clear there is any statistical difference in extract with replacement if we are taking 10,000 x 200 bytes from 8GB: typically, I see only one or two overlaps in 10,000 samplings. – Paul_Pedant Jan 23 '21 at 00:33
  • Thank you @Paul_Pedant ! – GSQ Jan 23 '21 at 00:45
  • 1
    @GSQ I found you posted my first script as AndresS at community.unix.com without acknowledgement, where you also asked a very similar question as GuitosS. As you don't appear to have any interest in learning anything yourself, I consider doing your job for you would incur my usual daily fee of 475 GBP. – Paul_Pedant Mar 05 '21 at 19:13
0

This is an implementation of a solution in awk. The data is 8GB of pseudo-random hex digits (actually a hex conversion of about 12 man pages, duplicated 3300 times). It is about 11 million lines averaging 725 bytes per line.

This is a timed execution.

Paul--) ls -l tdHuge.txt
-rw-r--r-- 1 paul paul 8006529300 Dec 24 22:38 tdHuge.txt
Paul--) ./rndSelect
inFile ./tdHuge.txt; Size 8006529300; Count 10000; Lth 200; maxIter 50; Db 1;
Iteration   1 needs  10000
Iteration   2 needs   2712
Overlap   9561: 7663038508 to 7663038657
Iteration   3 needs    728
Iteration   4 needs    195
Iteration   5 needs     50
Iteration   6 needs     11
Iteration   7 needs      2
Required 7 iterations
Reporting 10000 samples

real    2m3.326s
user    0m3.496s
sys 0m10.340s
Paul--) wc Result.txt
  20000   20000 2068894 Result.txt
Paul--) head -n 8 Result.txt | cut -c 1-40
>1
5706C69636174656420696E666F726D6174696F6
>2
20737472696E672028696E207768696368206361
>3
20646F6573206E6F742067657420612068617264
>4
647320616E642073746F7265732E204966207468
Paul--) tail -n 8 Result.txt | cut -c 1-40
>9997
6F7374207369676E69666963616E7420646F7562
>9998
7472696E676F702D73747261746567793D616C67
>9999
865726520736F6D652066696C6573206D7573742
>10000
5726E65642E205768656E20746865202D66206F7
Paul--) 

It requires iterations because it makes random probes into the file. If a probe overlaps an adjacent one, or a newline, then it is discarded and a smaller batch of new probes is made. With an average line length of 725 lines and a sample requirement of 200, almost 30% of probes will be too close to the end of a line to be acceptable. We don't know the average line length of the real data -- longer lines would improve the success ratio.

We also do not know whether the header lines (as noted in a previous related question of 04-Dec-2020) are still present in the file. But provided every header line is less than the sample length of 200, the header lines will be discarded (serendipity at its best).

The code is mainly GNU/awk (minimal bash) and has some comments. There is a lot of residual debug which can be hidden by setting Db=0 in the options.

#! /bin/bash

#.. Select random non-overlapping character groups from a file.

export LC_ALL="C"

#.. These are the optional values that will need to be edited.
#.. Command options could be used to set these from scripts arguments.

inFile="./tdHuge.txt"
outFile="./Result.txt"
Count=10000     #.. Number of samples.
Lth=200         #.. Length of each sample.
maxIter=50      #.. Prevents excessive attempts.

Size="$( stat -c "%s" "${inFile}" )"
Seed="$( date '+%N' )"
Db=1

#.. Extracts random non-overlapping character groups from a file.

Selector () {

    local Awk='
#.. Seed the random number generation, and show the args being used.
BEGIN {
    NIL = ""; NL = "\n"; SQ = "\047";
    srand (Seed % PROCINFO["pid"]);
    if (Db) printf ("inFile %s; Size %d; Count %d; Lth %d; maxIter %s; Db %s;\n",
        inFile, Size, Count, Lth, maxIter, Db);
    fmtCmd = "dd bs=%d count=1 if=%s iflag=skip_bytes skip=%d status=none";
}
#.. Constructs an array of random file offsets, replacing overlaps.
#.. Existing offsets are indexed from 1 to Count, deleting overlaps.
#.. Additions are indexed from -1 down to -N to avoid clashes.

function Offsets (N, Local, Iter, nSeek, Seek, Text, j) {

    while (N > 0 && Iter < maxIter) {
        ++Iter;
        if (Db) printf ("Iteration %3d needs %6d\n", Iter, N);

        for (j = 1; j <= N; ++j) {
            Seek[-j] = int ((Size - Lth) * rand());
            Text[Seek[-j]] = getSample( Seek[-j], Lth);
            if (Db7) printf ("Added %10d: \n", Seek[-j], Text[Seek[-j]]);
        }
        #.. Reindex in numeric order for overlap checking.
        nSeek = asort (Seek);
        if (Db7) for (j in Seek) printf ("%6d: %10d\n", j, Seek[j]);

        #.. Discard offsets that overlap the next selection.
        N = 0; for (j = 1; j < nSeek; ++j) {
            if (Seek[j] + Lth > Seek[j+1]) {
                if (Db) printf ("Overlap %6d: %10d to %10d\n",
                    j, Seek[j], Seek[j+1]);
                ++N; delete Text[Seek[j]]; delete Seek[j];
            } else if (length (Text[Seek[j]]) < Lth) {
                if (Db7) printf ("Short   %6d: %10d\n",
                    j, Seek[j]);
                ++N; delete Text[Seek[j]]; delete Seek[j];
            }
        }
    }
    if (Iter >= maxIter) {
        printf ("Failed with overlaps after %d iterations\n", Iter);
    } else {
        printf ("Required %d iterations\n", Iter);
        Samples( nSeek, Seek, Text);
    }
}
#.. Returns n bytes from the input file from position p.
function getSample (p, n, Local, cmd, tx) {

    cmd = sprintf (fmtCmd, n, SQ inFile SQ, p);
    if (Db7) printf ("cmd :%s:\n", cmd);
    cmd | getline tx; close (cmd);
    return (tx);
}
#.. Send samples to the output file.
function Samples (nSeek, Seek, Text, Local, j) {

    printf ("Reporting %d samples\n", nSeek);
    for (j = 1; j <= nSeek; ++j) {
        printf (">%d\n%s\n", j, Text[Seek[j]]) > outFile;
    }
    close (outFile);
}
END { Offsets( Count); }
'
    echo | awk -v Size="${Size}" -v inFile="${inFile}" \
        -v outFile="${outFile}" -v Count="${Count}" -v Lth="${Lth}" \
        -v maxIter="${maxIter}" \
        -v Db="${Db}" -v Seed="${Seed}" -f <( printf '%s' "${Awk}" )
}

#.. Test.

    time Selector
Paul_Pedant
  • 8,228
  • 2
  • 18
  • 26