Remove entries from one CSV file that are already present in another

Question

I have two files: 'file1' has employee ID numbers, 'file2' has the complete database of the employees. Here is what they look like:

file1
```
123123
222333
```

file2

111222 Jones Sally
111333 Johnson Roger
123123 Doe John
444555 Richardson George
222333 Smith Jane
223456 Alexander Philip

I want to compare the two files and eliminate the entries from file2 that have ID numbers in file1.

I found this awk command which works perfectly:

awk 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3

The result:

file3

111222 Jones Sally
111333 Johnson Roger
444555 Richardson George
223456 Alexander Philip

So this works as expected.

My problem is that the files are actually simplified .csv files, and I must use a comma as a separator rather than a space. I have tried everything I can think of to make this work (i.e -F, , -F',' , -F"," everywhere in the command) and no success.

How do I get this to work with .csv files?

By the way, I am on MacBook Pro, OSX Lion!

Did you have a space after `-F`? – Sparhawk Feb 13 '15 at 03:40 — Sparhawk, Feb 13 '15 at 03:40
Do you want commas in the input or output, or both? – Sparhawk Feb 13 '15 at 04:51 — Sparhawk, Feb 13 '15 at 04:51

score 5 · Answer 1 · answered Feb 13 '15 at 03:56

5

I'm assuming your csv files are something like:

File1

123123,,
222333,,

File2

111222,Jones,Sally
111333,Johnson,Roger
123123,Doe,John
444555,Richardson,George
222333,Smith,Jane
223456,Alexander,Philip

You could try using the join command, like so:

# join -t, -v 2 <(sort file1) <(sort file2)
111222,Jones,Sally
111333,Johnson,Roger
223456,Alexander,Philip
444555,Richardson,George

More information about the command can be found here: man join

join [OPTION]... FILE1 FILE2

-t CHAR
    use CHAR as input and output field separator 
-v FILENUM
    like -a FILENUM, but suppress joined output lines

answered Feb 13 '15 at 03:56

devnull

5,331
21
36

"join" works, thanks. However, sometimes I want to use a different field in the files. So maybe the "awk" is better. – pgrason Feb 13 '15 at 17:28
@pgrason Define 'fields', if there is a common field in both, join should always work. – devnull Feb 13 '15 at 19:10

score 2 · Answer 2 · edited Feb 13 '15 at 06:06

2

Try this:

awk 'BEGIN{FS=","};FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3

edited Feb 13 '15 at 06:06

Stephen Kitt

411,918
54
1,065
1,164

answered Feb 13 '15 at 05:29

R Netzlof

21
1

The idea is okay, but a code snippet-only answer is not. – peterh Feb 13 '15 at 06:18
This works the way I want. I can chose which field to use as the key. Thanks. Will there be any problems with very large file1 & file2? – pgrason Feb 13 '15 at 17:32
I just tried this command on two large .csv files and it worked just as I wanted. Thanks! – pgrason Feb 13 '15 at 18:42
@pgrason Is this the way you solved your problem? Then please accept the answer, so others know what worked for you. – Matthias B Aug 23 '18 at 09:30

score 1 · Answer 3 · answered Feb 13 '15 at 04:37

You can also try the following Python2 solution:

#!/usr/bin/env python2
import csv
with open('file_1') as f1:
    file_1_list = [line[0] for line in csv.reader(f1)]
with open('file_2') as f2:
    for line in csv.reader(f2):
        if line[0] not in file_1_list:
            print ' '.join(line)

score 0 · Answer 4 · edited Apr 20 '16 at 22:20

0

Add the -F',' which tells awk that there is a comma delimiter or if you are using |. use -F'|'

so your script should look like this now

awk -F',' 'FNR==NR{a[$1];next};!($1 in a)' file1 file2 > file3

edited Apr 20 '16 at 22:20

Anthon

78,313
42
165
222

answered Apr 20 '16 at 22:09

Frank Lu

1

Remove entries from one CSV file that are already present in another

4 Answers4

File1

File2