3

I have two files on my Linux machine. The first "list.txt" contains a list of objects (2649 objects) while the second "list_interactors.txt" contains a shorter list with some of the objects in the previously list (719 objects) and for each of these there are in other columns some variables associated. I would like to obain a list of all the objects (2649) with the associated variable for the specific objects in file "list_interactors".

Example:

file list.txt

6tyr_A_002__________
7yer_2_009__________
3erf_1_001__________
2dr5_D_2-3__________

file list_interactors.txt

6tyr_A_002__________    6tyr1_B    QRT54R   AAAAA
3erf_1_001__________    3erf2_B    QAEF6R   XXXXX

output.txt

6tyr_A_002__________    6tyr1_B    QRT54R   AAAAA
7yer_2_009__________
3erf_1_001__________    3erf2_B    QAEF6R   XXXXX
2dr5_D_2-3__________

I'm not very pratical of the programming languages. I try to use the function grep with this script:

grep -f list.txt list_interactors.txt

but the output is a file like the file "list_interactors.txt".

Could you help me please?

terdon
  • 234,489
  • 66
  • 447
  • 667
Tommaso
  • 171
  • 1
  • 9
  • 1
    Probably the tool you are looking for is `join`, not `grep`. Check the [man page](https://linux.die.net/man/1/join) – Francesco May 26 '20 at 08:37
  • 1
    The behavior of `grep` you see is because the `-f` option takes _matching rules_ (=filtering rules) from the file. In the end, your command says "print all lines in `list_interactors.txt` that contain one of the strings in `list.txt` (which in your case is _every_ line in `list_interactors.txt`). – AdminBee May 26 '20 at 09:13

4 Answers4

12
$ join -a 1  <( sort list.txt ) <( sort list_interactors.txt )
2dr5_D_2-3__________
3erf_1_001__________ 3erf2_B QAEF6R XXXXX
6tyr_A_002__________ 6tyr1_B QRT54R AAAAA
7yer_2_009__________

This uses join to do a relational JOIN operation between the two files. The first field will be used as the join key by default.

The -a 1 option makes join output all lines in the first file, even if there is no match in the second file (it does a "left join").

The input data to join needs to be sorted, and we do this by calling sort on each file individually in two process substitutions on the command line. You could also opt for pre-sorting the files.

If your data is tab-delimited, you may want to add -t $'\t' to the start of the join command's arguments. This would make the output retain the existing tab delimiters.

Redirect the output by adding >output.txt to the end of the command if you want to store it in a file.

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
5

If you want to keep the sorting you can use awk:

awk '
    FNR==NR {s[$1]=$0}
    FNR!=NR {if(s[$1]) print s[$1]; else print $0}
' list_interactors.txt list.txt

Output:

6tyr_A_002__________    6tyr1_B    QRT54R   AAAAA
7yer_2_009__________
3erf_1_001__________    3erf2_B    QAEF6R   XXXXX
2dr5_D_2-3__________
pLumo
  • 22,231
  • 2
  • 41
  • 66
1
$ awk 'NR==FNR{a[$1]=$0; next} {print ($1 in a ? a[$1] : $0)}' list_interactors.txt list.txt
6tyr_A_002__________    6tyr1_B    QRT54R   AAAAA
7yer_2_009__________
3erf_1_001__________    3erf2_B    QAEF6R   XXXXX
2dr5_D_2-3__________
Ed Morton
  • 28,789
  • 5
  • 20
  • 47
1

Perl one liner can also do :

$ perl -ane ' { chomp;$s{$F[0]}=$_; } END { print "$s{$_}\n" for sort(keys(%s))  }' list.txt list_interactors.txt 
2dr5_D_2-3__________
3erf_1_001__________    3erf2_B    QAEF6R   XXXXX
6tyr_A_002__________    6tyr1_B    QRT54R   AAAAA
7yer_2_009__________