2

I want to extract common number present in all file. I have 1000 file in folder. I Want to compare all file number and find out common number in 1000 file. I have used below code:

for ((i=2;i<=10000;i++))  
do
comm -12 --nocheck-order a.txt "$i".txt > final.txt
mv final.txt file.txt
done

But it is only over writting and comparing only last file with a.txt. But I want common number present in all file.

let say a.txt file:

1
3
47
8
6
7

1.txt file :

2
3
6
7
8

2.txt file :

3
5
6
7
9

3.txt and 4.txt....1000.txt. If this works fine for this 3 file, It should work fine for all file. So common in this file is:

3
7

while it is giving me

3
8
3

Please let me know how I can proceed?

αғsнιη
  • 40,939
  • 15
  • 71
  • 114
  • Where from did the file.txt come into the picture? Should it not be a.txt instead? We need to update the a.txt every iteration. Post iterstion, this file would hold the final answer. – Rakesh Sharma Jan 26 '20 at 16:32
  • (1) If you have 1000 (10³) files, why are you running the loop up to 10000 (10⁴)?  (2) If you have a `1.txt`, why are you starting your loop at `i=2`?  (3) If you have an `a.txt`, and also `1.txt` through `1000.txt`, then you have 1001 files, don’t you?  (4) The number “6” is present in the three files you presented (`a.txt`, `1.txt` and `2.txt`); why do you not expect it to be in your output? – G-Man Says 'Reinstate Monica' Mar 16 '20 at 19:19

5 Answers5

0

first to know that to have comm report common lines among multiple files correctly you will need to pass input files as sorted if they are not already sorted before.

second is you need change your mv command to mv final.txt a.txt; in order to check next file against of the result of previous attempt; here I took a backup from a.txt and used common.txt instead for iteration on it in the for-loop.

so you will have your final script as below:

cp a.txt common.txt
for ((i=1;i<=10000;i++));
    do comm -12  <(sort common.txt) <(sort $i.txt) >temp.txt;
    mv temp.txt common.txt;
done

finally cat common.txt will be those lines are common among 10000 files .

αғsнιη
  • 40,939
  • 15
  • 71
  • 114
  • You can also do the loop iteration like `for i in {1..10000}; do ...; done` – annahri Jan 25 '20 at 09:48
  • @annahri that way will fail if there was million of files when shell wants to expand; check with `printf '%s\n' {1..1000000000} -bash: brace expansion: failed to allocate memory for 1000000000 elements` another issue with this is it's slow because it should wait shell expand the brace then it will do the loop over elements built. – αғsнιη Jan 25 '20 at 10:25
0

comm is only for sorted files, the

Compare sorted files FILE1 and FILE2 line by line.

source : https://linux.die.net/man/1/comm

so the algorithm will not work for unsorted files. This works :

#!/bin/sh

sort -n a.txt > tmp.txt

END=4

for i in `seq 2 $END`
do
comm -12 --nocheck-order tmp.txt $i.txt |tee tmp.txt
done
cp tmp.txt "final.txt"

Instead of the |tee you can also use the > operator on some systems (writing to file with overwriting it)

ralf htp
  • 111
  • 6
0

Your issue is with overwriting your output in each iteration. The > redirection will truncate (empty) the file that you redirect into.

Another approach: Sort each file individually with sort -u to get the entries unique to that file, then concatenate the results of these sorting operations and pass it through sort | uniq -c to get the sum of the number of times each entry appears in all files. Then go through that result and pick out the entries that have counts equal to the number of files. These are the entries that occur in all files.

set -- ./*.txt

for file do
    sort -u "$file"
done | sort | uniq -c |
awk -v c="$#" '$1 == c { print $2 }'

I'm making use of the positional parameters here, setting them to the list of files that we'd like to iterate over with set -- ./*.txt. I do this so that we can use $# later, which is the number of files in the list.

For the three first datasets that you show, this would output

3
6
7
Kusalananda
  • 320,670
  • 36
  • 633
  • 936
0

Assuming each number can only appear once in a file:

$ awk '{c[$1]++} END{for (i in c) if (c[i] == (ARGC-1)) print i}' a.txt {1..2}.txt
3
6
7
Ed Morton
  • 28,789
  • 5
  • 20
  • 47
0

Beside other issues, your script loop is calling comm 10000 times making it very slow. A faster alternative is to sort all of them and count the repeats. The lines that have a count equal to the number of files exists in all files (if values inside each file are not repeated):

set -- ./*.txt
sort -n "$@" | uniq -c | awk -vcount="$#" '$1==count{print $2}'

I am using the positional arguments to both have the list of files "$@" and also the count of files $#.

The sort is numeric -n as you ask for a number.

You could check (and sort) that no file has repeated numbers with:

set -- ./*.txt

for    f; do
       sort -n "$f"          > "$f.tempfile"
       mv      "$f.tempfile"   "$f"
       if [ "$(uniq -d "$f")" != "" ]; then echo $f; fi
done

Which will list all the files that have repeated numbers and that will sort each individual file.