find duplicate 1st field and concat its values in single line

Question

I have a file that has entries in key: value format like the below:

cat data.txt

name: 'tom'
tom_age: '31'
status_tom_mar: 'yes'
school: 'anne'
fd_year_anne: '1987'
name: 'hmz'
hmz_age: '21'
status_hmz_mar: 'no'
school: 'svp'
fd_year_svp: '1982'
name: 'toli'
toli_age: '41'

and likewise ...

I need to find and print only those key: value that have duplicate keys as a single entry.

The below code gets me the duplicate keys

cat data.txt | awk '{ print $1 }' | sort  | uniq -d
name:
school:

However, I want the output where I wish to concatenate the values of duplicate keys in one line.

Expected output:

name: ['tom', 'hmz', 'toli']
school: ['anne', 'svp']
tom_age: '31'
status_tom_mar: 'yes'
fd_year_anne: '1987'
hmz_age: '21'
status_hmz_mar: 'no'
fd_year_svp: '1982'
toli_age: '41'

Can you please suggest?

Is this a YAML file that you are manipulating? If so use a syntax aware parser like yq — Inian, Mar 23 '22 at 10:33
Yes yaml, Can you share command for yq that will help meet the requirement? Never heard of it before — Ashar, Mar 23 '22 at 10:38
Correct but i need to convert this to a valid yaml. The file data.txt is not in our control — Ashar, Mar 23 '22 at 10:40
Its a variable file that is read by ansible and works alright — Ashar, Mar 23 '22 at 10:42
Your `hmz_age` and `fd_year_anne` is duplicated in the output. — pLumo, Mar 23 '22 at 10:53
@pLumo that was due to overlook. I made the corrections in original post — Ashar, Mar 23 '22 at 10:59
I'm not supposed to care about what you want, but it's hard to ignore that `_age`, `status__mar` and `fd_year_` cry out for a different record structur. — Philippos, Mar 23 '22 at 13:10
@Philippos what makes you think you're "not supposed to care about what you want"? Sometimes (often, even) the right answer to a problem involves pointing out **better** ways of doing things.....and the data samples shown here are, as you say, crying out for [normalisation](https://en.wikipedia.org/wiki/Database_normalization). And if whatever generates these psuedo-yaml files can't be fixed, then any tool that has to process them absolutely **should** attempt to do that. Even if the OP explicitly asks for garbage input data to be converted into another equally-garbage output format. — cas, Mar 24 '22 at 11:16
Sorry, I don't mean to cause embarrassment, but it's true: the output format you asked for is no better than the input. It should be converted into some kind of record format, where each record has the same fields - name, age, marital status, fd year, etc instead of having unique keys (like hmz_age or fd_year_anne) randomly scattered throughout the data (probably grown into a horrible mess with random additions rather than designed). Depending on what you're going to use the data for, each record should probably also have some kind of unique identifier field (and names certainly aren't unique) — cas, Mar 24 '22 at 13:00

score 5 · Accepted Answer · edited Mar 24 '22 at 12:35

In awk:

$ awk -F': ' '
{
    count[$1]++; 
    data[$1] = $1 in data ? data[$1]", "$2 : $2 
} 
END { 
    for (id in count) { 
        printf "%s: ",id; 
        print (count[id]>1 ? "[ "data[id]" ]" : data[id])
    }
}' data.txt 
hmz_age: '21'
tom_age: '31'
fd_year_anne: '1987'
school: [ 'anne', 'svp' ]
name: [ 'tom', 'hmz', 'toli' ]
toli_age: '41'
fd_year_svp: '1982'
status_hmz_mar: 'no'
status_tom_mar: 'yes'

A Perl approach:

$ perl -F: -lane 'push @{$k{$F[0]}},$F[1]; 
        END{ 
            for $key (keys(%k)){ 
                $data=""; 
                if(scalar(@{$k{$key}})>1){ 
                    $data="[" . join(",",@{$k{$key}}) . "]"; 
                } 
                else{
                    $data=${$k{$key}}[0];
                }
                print "$key: $data"
            }
        }' data.txt 
status_tom_mar:  'yes'
fd_year_anne:  '1987'
tom_age:  '31'
toli_age:  '41'
fd_year_svp:  '1982'
hmz_age:  '21'
school: [ 'anne', 'svp']
name: [ 'tom', 'hmz', 'toli']
status_hmz_mar:  'no'

Or, a bit easier to understand maybe:

perl -F: -lane '@fields=@F; 
                push @{$key_hash{$fields[0]}},$fields[1]; 
                END{ 
                    for $key (keys(%key_hash)){ 
                        $data=""; 
                        @key_data=@{$key_hash{$key}};
                        if(scalar(@key_data)>1){ 
                           $data="[" . join(",", @key_data) . "]"; 
                        } 
                        else{
                            $data=$key_data[0]
                        }
                        print "$key: $data"
                    }
                }' data.txt

@Ashar I added an awk approach. But you don't have perl? What operating system is this? AIX? Please always mention your OS in the question because we need to know what tools and what versions of the tools are available. — terdon, Mar 23 '22 at 11:55

roaima · Answer 2 · 2022-03-23T22:59:58.347

A short awk program will achieve this for you

awk -F': ' '
    # Every line of input; fields split at colon+space
    {
        # Append a comma if we have previous items
        if (h[$1] > "") { h[$1] = h[$1] ", " };

        # Append the item and increment the count
        h[$1] = h[$1] $2;
        i[$1]++
    }

    # Finally
    END {
        # Iterate across all the keys we have found
        for (k in h) {
            if (i[k] > 1) { p = "[%s]" } else { p = "%s" };
            printf "%s: " p "\n", k, h[k]
        }
    }
' data.txt

Output

hmz_age: ['21', '41']
tom_age: '31'
fd_year_anne: ['1987', '1982']
school: ['anne', 'svp']
name: ['tom', 'hmz', 'toli']
status_hmz_mar: 'no'
status_tom_mar: 'yes'

K-attila- · Answer 3 · 2022-03-23T13:56:23.913

2

In awk: awk '{arr[$1][length(arr[$1])+1]=$2}; END {for (i in arr) {printf i;if (length(arr[i])>1) {xc=" [";for (rr in arr[i]) {printf xc;printf arr[i][rr];xc=","} print "]"} else print arr[i][length(arr[i])]} }' data.txt

Output:

hmz_age:'21'
fd_year_svp:'1982'
fd_year_anne:'1987'
name: ['tom','hmz','toli']
school: ['anne','svp']
status_tom_mar:'yes'
tom_age:'31'
toli_age:'41'
status_hmz_mar:'no'

edited Mar 23 '22 at 13:56

answered Mar 23 '22 at 13:48

K-attila-

624
2
13

2

I would suggest that your answer would be greatly improved if you were to explain how it works. Not everyone will be able to understand your code – roaima Mar 23 '22 at 22:59
1

Only _GNU_ awk has true array of array (not the traditional [i,j,...] using SUBSEP) – dave_thompson_085 Mar 24 '22 at 03:13
3

Never do `printf foo` for any input data as it'll fail when that data contains printf formatting chars like `%s`, use `printf "%s", foo` instead. – Ed Morton Mar 26 '22 at 21:29
@dave_thompson_085 i think that `gawk` was implicit. this and a few else can apply `length` to an array. – DanieleGrassini Apr 08 '22 at 22:05

jubilatious1 · Answer 4 · 2022-04-08T17:05:04.417

Using Raku (formerly known as Perl_6):

raku -e 'my %h; for lines() {%h.=append: .split(":").map(*.trim).hash}; .say for %h;'

OR

raku -e 'my %h.=append: .split(":").map(*.trim).hash for lines; .say for %h;'

With Raku, you have hash functionalities built in (see the docs pages at bottom). Briefly, the code above takes lines, splits on ":" colon, trims whitespace from resulting 2 elements, and generates a hash (i.e. key-value pair). Each line's hash is then appended to the named %h (hash) object, and values are appropriately added to their respective keys.

Sample Input:

name: 'tom'
tom_age: '31'
status_tom_mar: 'yes'
school: 'anne'
fd_year_anne: '1987'
name: 'hmz'
hmz_age: '21'
status_hmz_mar: 'no'
school: 'svp'
fd_year_svp: '1982'
name: 'toli'
toli_age: '41'

Sample Output:

hmz_age => '21'
fd_year_svp => '1982'
status_tom_mar => 'yes'
fd_year_anne => '1987'
school => ['anne' 'svp']
status_hmz_mar => 'no'
tom_age => '31'
name => ['tom' 'hmz' 'toli']
toli_age => '41'

Once your data is in the %h object you can manipulate output. Substituting .put for .say in the code above gives tab-separated (not => separated) return. Furthermore, you can pull out values associated with individual keys like so (add below as a final statement):

say %h<name>;'
['tom' 'hmz' 'toli']

https://docs.raku.org/language/hashmap
https://docs.raku.org/language/101-basics#Hashes

score -1 · Answer 5 · answered Apr 07 '22 at 07:28

step1

for i in $(awk -F ":" '{a[$1]++}END{for(x in a){print x,a[x]}}' file.txt | awk '$NF>1{print $1}'|tac); do grep "^$i" file.txt >/dev/null; if [[ $? == 0 ]]; then awk -v i="$i" -F ":" '$1 == i{print $2}' file.txt|awk 'END{print "\n"}ORS=","'|sed "s/^,//g"|sed "s/,$//g"|awk -v i="$i" '{print i":["$0"]"}';else grep -v "^$i" file.txt;fi; done >output.txt

step2

for i in $(awk -F ":" '{a[$1]++}END{for(x in a){print x,a[x]}}' file.txt| awk '$NF==1'); do awk -v i="$i" -F ":" '$1 ~ i' file.txt; done >>output.txt

output

name: ['tom', 'hmz', 'toli']
school: ['anne', 'svp']
tom_age: '31'
status_tom_mar: 'yes'
fd_year_anne: '1987'
hmz_age: '21'
status_hmz_mar: 'no'
fd_year_svp: '1982'
toli_age: '41'

find duplicate 1st field and concat its values in single line

5 Answers5