2

This is similar to Shuffle two parallel text files

I have:

  • two large csv files with parallel lines. (they represent 'before' and 'after' states for particular items). The fields are sometimes strings, sometimes numbers.

  • a sufficiently long random data file to use with shuf

when I want to get a matching random sample I thought of:

shuf -n10 --random-source="random.csv" "file1" 
shuf -n10 --random-source="random.csv" "file2" 

but these files no longer match.

However, if I put line-numbers in front, it solves the problem:

shuf -n10 --random-source="random.csv" <(cat -n "file1") 
shuf -n10 --random-source="random.csv" <(cat -n "file2")

Can someone explain why?

here is sample of random.csv

0.293076138
0.446732207
0.552989654
0.16141527
0.099383023
...

Here is a snippet from the two files:

VA,DEFAULT,72.8027,11.9534.....
VA,DEFAULT,61.8356,11.9342....
VA,DEFAULT,61.8356,....

Note that the first two fields are identical in most of the rows in both files. Maybe this is the issue? I don't know shuf well enough.

Tim
  • 237
  • 2
  • 8
  • can't reproduce. Please paste a snippet of `random.csv` – iruvar Dec 05 '19 at 01:28
  • I also can not reproduce this behaviour. Can we assume that `random.csv` is not changing between the invocations of `shuf`? – Kusalananda Dec 05 '19 at 10:52
  • @Kusalananda certainly. it is simply a list of random numbers that I saved as a .csv file. It remains fixed. – Tim Dec 05 '19 at 21:29
  • 1
    Note: in my Debian 9 `shuf (GNU coreutils) 8.26` generates different sequences for seekable vs. unseekable input, when `-n` is lower or equal to the number of lines. I.e. ` – Kamil Maciorowski Dec 05 '19 at 22:26
  • a good suggestion, but no, all files are plain csv files, no pipes, no redirections. – Tim Dec 12 '19 at 03:29

0 Answers0