6

I have two files: 'file1' has employee ID numbers, 'file2' has the complete database of the employees. Here is what they look like:

  • file1
    123123
    222333
    
  • file2
    111222 Jones Sally
    111333 Johnson Roger
    123123 Doe John
    444555 Richardson George
    222333 Smith Jane
    223456 Alexander Philip
    

I want to compare the two files and eliminate the entries from file2 that have ID numbers in file1.

I found this awk command which works perfectly:

awk 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3

The result:

  • file3
    111222 Jones Sally
    111333 Johnson Roger
    444555 Richardson George
    223456 Alexander Philip
    

So this works as expected.

My problem is that the files are actually simplified .csv files, and I must use a comma as a separator rather than a space. I have tried everything I can think of to make this work (i.e -F, , -F',' , -F"," everywhere in the command) and no success.

How do I get this to work with .csv files?

By the way, I am on MacBook Pro, OSX Lion!

AdminBee
  • 21,637
  • 21
  • 47
  • 71
pgrason
  • 61
  • 1
  • 1
  • 5

4 Answers4

5

I'm assuming your csv files are something like:

File1

123123,,
222333,,

File2

111222,Jones,Sally
111333,Johnson,Roger
123123,Doe,John
444555,Richardson,George
222333,Smith,Jane
223456,Alexander,Philip

You could try using the join command, like so:

# join -t, -v 2 <(sort file1) <(sort file2)
111222,Jones,Sally
111333,Johnson,Roger
223456,Alexander,Philip
444555,Richardson,George

More information about the command can be found here: man join

join [OPTION]... FILE1 FILE2

-t CHAR
    use CHAR as input and output field separator 
-v FILENUM
    like -a FILENUM, but suppress joined output lines 
devnull
  • 5,331
  • 21
  • 36
  • "join" works, thanks. However, sometimes I want to use a different field in the files. So maybe the "awk" is better. – pgrason Feb 13 '15 at 17:28
  • @pgrason Define 'fields', if there is a common field in both, join should always work. – devnull Feb 13 '15 at 19:10
2

Try this:

awk 'BEGIN{FS=","};FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3
Stephen Kitt
  • 411,918
  • 54
  • 1,065
  • 1,164
R Netzlof
  • 21
  • 1
  • The idea is okay, but a code snippet-only answer is not. – peterh Feb 13 '15 at 06:18
  • This works the way I want. I can chose which field to use as the key. Thanks. Will there be any problems with very large file1 & file2? – pgrason Feb 13 '15 at 17:32
  • I just tried this command on two large .csv files and it worked just as I wanted. Thanks! – pgrason Feb 13 '15 at 18:42
  • @pgrason Is this the way you solved your problem? Then please accept the answer, so others know what worked for you. – Matthias B Aug 23 '18 at 09:30
1

You can also try the following Python2 solution:

#!/usr/bin/env python2
import csv
with open('file_1') as f1:
    file_1_list = [line[0] for line in csv.reader(f1)]
with open('file_2') as f2:
    for line in csv.reader(f2):
        if line[0] not in file_1_list:
            print ' '.join(line)
heemayl
  • 54,820
  • 8
  • 124
  • 141
0

Add the -F',' which tells awk that there is a comma delimiter or if you are using |. use -F'|'

so your script should look like this now

awk -F',' 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3
Anthon
  • 78,313
  • 42
  • 165
  • 222