21

I am using Solaris 10 and so grep options involving -f don't work.

I have two pipe-separated files:

file1:

abc|123|BNY|apple|
cab|234|cyx|orange|
def|kumar|pki|bird|

file 2:

abc|123|
kumar|pki|
cab|234

I would like to compare the first two columns of file2 with file1 (search through the entire contents of file1 in first two columns) if they match print the matched line of file1. Then search for the second line of file 2 and so on.

Expected Output:

abc|123|BNY|apple|
cab|234|cyx|orange|

The files I have are huge, containing about 400,000 lines, so I would like to make the execution fast.

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
user68365
  • 231
  • 2
  • 3
  • 7
  • I removed the leading spaces from your examples, if you want it, please roll the edit back. Remember that spaces are significant, you should only have them if they exist in your actual files. – terdon Jun 06 '14 at 10:29
  • Try using the GNU version of `grep`, it's under `/usr/sfw/bin/ggrep`. http://stackoverflow.com/questions/15259882/grep-e-in-solaris-complaining-usr-xpg4-bin-grep – slm Jun 06 '14 at 10:55

4 Answers4

24

This is what awk was designed for:

$ awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0' file2 file1
abc|123|BNY|apple|
cab|234|cyx|orange|

Explanation

  • -F'|' : sets the field separator to |.
  • NR==FNR : NR is the current input line number and FNR the current file's line number. The two will be equal only while the 1st file is being read.
  • c[$1$2]++; next : if this is the 1st file, save the 1st two fields in the c array. Then, skip to the next line so that this is only applied on the 1st file.

  • c[$1$2]>0 : the else block will only be executed if this is the second file so we check whether fields 1 and 2 of this file have already been seen (c[$1$2]>0) and if they have been, we print the line. In awk, the default action is to print the line so if c[$1$2]>0 is true, the line will be printed.


Alternatively, since you tagged with Perl:

perl -e 'open(A, "file2"); while(<A>){/.+?\|[^|]+/ && $k{$&}++};
         while(<>){/.+?\|[^|]+/ && do{print if defined($k{$&})}}' file1

Explanation

The first line will open file2, read everything up to the 2nd | (.+?\|[^|]+) and save that (the $& is the result of the last match operator) in the %k hash.

The second line processes file1, uses the same regex to extract the 1st two columns and print the line if those columns are defined in the %k hash.


Both of the above approaches will need to hold the 2 first columns of file2 in memory. That shouldn't be a problem if you only have a few hundred thousand lines but if it is, you could do something like

cut -d'|' -f 1,2 file2 | while read pat; do grep "^$pat" file1; done

But that will be slower.

terdon
  • 234,489
  • 66
  • 447
  • 667
  • But won't this load all (the first two columns) of `file2` into memory? – Joseph R. Jun 06 '14 at 10:44
  • @terdon: `awk -F'|' 'NR==FNR{c[$1$2]++;next};c[$1$2] > 0'` is shorter version. – cuonglm Jun 06 '14 at 10:46
  • it doesnt work .. – user68365 Jun 06 '14 at 10:47
  • @user68365: Does `file2` have duplicate rows? – cuonglm Jun 06 '14 at 10:48
  • NO it doesnt have any duplicate rows – user68365 Jun 06 '14 at 10:52
  • @user68365 please explain _how_ it doesn't work. Also please show us an example of your _exact_ input file. I am guessing what you have is different from what you've shown. – terdon Jun 06 '14 at 10:55
  • @Gnouc damn! Thanks, I should have thought of `next`. – terdon Jun 06 '14 at 10:55
  • @JosephR. yes but that's by far the fastest way. The `cut` approach avoids that but it will be slower. Anyway, a few hundred thousand lines is not that much. – terdon Jun 06 '14 at 10:58
  • @terdon The perl one works well !!..gives the desired output ,Can you explain it – user68365 Jun 06 '14 at 11:02
  • @terdon : when i run the awk it prints nothing – user68365 Jun 06 '14 at 11:02
  • @user68365 I added an explanation of the Perl one. I can't tell you why the awk failed unless you show me your actual data. They should both give the same output. – terdon Jun 06 '14 at 11:09
  • Is there an easy way to edit the awk command to print lines that do not match? – cosmictypist Jan 22 '16 at 13:27
  • Is it simply `!c[$1$2]` at the end? – cosmictypist Jan 22 '16 at 13:30
  • @stellar01 the trick here is that we're using the values of the 1st and 2cnd field concatenated as the key of an associated array. Each time a particular combination of 1st and 2nd field is seen, the value associated with them in the array is incremented by one. Therefore, to get the ones that only appear in one of the two files (assuming you can never have the same 1st and 2nd field on multiple lines of the same file), you want the ones whose value is `0`. So, replace `c[$1$2] > 0` with `c[$1$2] == 0`. – terdon Jan 22 '16 at 13:47
1

I think

grep -Ff file2 file1

is what you're looking for. It should be efficient, but I'm not sure it will be as accurate as you want. If abc|123 (for example) is found in a line in file1 in different columns, that line will be printed as well. If you can guarantee that this will never happen, the above line should work.

Joseph R.
  • 38,849
  • 7
  • 107
  • 143
  • Grep wouldnt be sufficient ,since the abc|123 may be present somewhere in th file. Moreover i am using solaris 10 and i am unable to use that grep option too. – user68365 Jun 06 '14 at 10:29
  • 2
    @user68365 please clarify all this in your question. You need to tell us your OS and specify that you only want to match the first 2 columns. – terdon Jun 06 '14 at 10:32
1

If you would like to think the problem in SQL like way, then you definitely should try a tool named 'q':

$ q -d '|' "select f1.* from file1 f1 join file2 f2 on (f1.c1 = f2.c1 and f1.c2 = f2.c2)"

It is more clear and easy to understand if you're familiar with SQL query.

Vincent
  • 111
  • 3
0
$  sed 's/^/\^/' 2.txt > temp.txt ; grep 1.txt -f temp.txt
abc|123|BNY|apple|
cab|234|cyx|orange|
mr_tron
  • 344
  • 2
  • 11