Deduplication with awk on command line and script

Question

I have a file that has the following format:

487422,Potenza
487386,Forlì-Cesena
487399,Grosseto
487425,Catanzaro
487409,Napoli
487446,Prato
495498,Fermo
487425,Catanzaro
487389,Macerata
487442,Biella
487351,Asti
487424,Cosenza
487404,Roma
487359,Como
487404,Roma
487401,Terni
487420,Brindisi
487397,Arezzo
487348,Vercelli
487382,Modena
487356,Genova
487365,Cremona
487369,Verona
487386,Forlì-Cesena

As you can see, it is a comma-separated text with duplicates. I would like to deduplicate the text with respect to column 1 using awk.

Command line

If I use the shell interface, this is what I get

487422,Potenza
487386,Forlì-Cesena
487399,Grosseto
487425,Catanzaro
487409,Napoli
487446,Prato
495498,Fermo
487389,Macerata
487442,Biella
487351,Asti
487424,Cosenza
487404,Roma
487359,Como
487401,Terni
487420,Brindisi
487397,Arezzo
487348,Vercelli
487382,Modena
487356,Genova
487365,Cremona
487369,Verona

which is what I would expect from the following command

awk -F"," '!a[$1]++' filename.csv

Awk script

If I use the awk script written as follows

#!/bin/awk -f

BEGIN {
    FS=","
}
{
    {!a[$1]++}
}

I do not get any output. Is there something wrong with the script? Why is the behaviour different between the script and the command line?

@EdMorton I am aware of the fact that I could use sort, but I would like a solution that preserves the order and that can also be run in optimal time over thousands of records. Thanks anyways! — shoyip, Oct 24 '21 at 21:44
Do you mean that actually what awk is also doing under the hood is the same kind of operation? Because somehow doing this wrt piping sort and uniq takes a bit less time. — shoyip, Oct 24 '21 at 21:59
No, it's a different operation but sort is designed to efficiently handle large amounts of data, using demand paging as necessary, while awk is storing every key value in memory to be able to do a hash lookup of every key read and so getting slower and more likely to fail as the number of keys increases. Awk will be fine if you only have thousands of records but once you get into millions a decorate/sort/undecorate approach would probably be faster and when you get into the billions you might find you need that to be able to do the processing at all. — Ed Morton, Oct 24 '21 at 22:09
I added a decorate/sort/undecorate answer so you can see what that looks like. — Ed Morton, Oct 24 '21 at 22:21
@EdMorton, I've tried looking up "unix sort demand paging" but nothing comes up. I'd like to read up on how this works, can you point to some resources? — Zach Young, Nov 05 '21 at 03:24
@ZachYoung Google "Unix sort large files" and you'll see results like http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-unix-sort.html and https://unix.stackexchange.com/a/279099/133219. — Ed Morton, Nov 05 '21 at 16:01

score 4 · Accepted Answer · answered Oct 24 '21 at 19:36

4

Outside of braces, !a[$1]++ is a condition, which triggers the default action {print} if it evaluates true (non-zero).

Inside braces, {{!a[$1]++}} is an action that is evaluated unconditionally with no side effect. Remove the braces:

#!/bin/awk -f

BEGIN {
    FS=","
}

!a[$1]++

answered Oct 24 '21 at 19:36

steeldriver

78,509
12
109
152

Thanks, that works! How about, once found the unique values in column 1, printing values in column 1,3,4? How do I combine this with the print statement? How would I be able to combine this, for example, with an if statement over a column ($2 contains string IT, for example)? – shoyip Oct 24 '21 at 21:50
@shoyip something like `!a[$1]++ {print $1,$3,$4}` or `!a[$1]++ && $2 ~ /IT/ {print $1,$3,$4}` – steeldriver Oct 25 '21 at 00:21

score 2 · Answer 2 · answered Oct 24 '21 at 22:21

@steeldriver's awk answer is correct and is probably all you need, but if your input gets massive it may run out of memory and/or get relatively slow and in that case here's a decorate/sort/undecorate approach that will continue to work:

nl -w1 -s, file |       # Decorate by prefixing with line numbers
sort -ut, -k2,2 |       # Sort uniquely by the real key field
sort -nt, -k1,1 |       # Sort whats left by the line numbers we added
cut -d, -f2-            # Undecorate by removing the line numbers

Thanks! This would be a very nice trick in order to preserve order while selecting unique entries. — shoyip, Nov 01 '21 at 18:42

Deduplication with awk on command line and script

Command line

Awk script

2 Answers2