3

I have a file that has entries in key: value format like the below:

cat data.txt

name: 'tom'
tom_age: '31'
status_tom_mar: 'yes'
school: 'anne'
fd_year_anne: '1987'
name: 'hmz'
hmz_age: '21'
status_hmz_mar: 'no'
school: 'svp'
fd_year_svp: '1982'
name: 'toli'
toli_age: '41'

and likewise ...

I need to find and print only those key: value that have duplicate keys as a single entry.

The below code gets me the duplicate keys

cat data.txt | awk '{ print $1 }' | sort  | uniq -d
name:
school:

However, I want the output where I wish to concatenate the values of duplicate keys in one line.

Expected output:

name: ['tom', 'hmz', 'toli']
school: ['anne', 'svp']
tom_age: '31'
status_tom_mar: 'yes'
fd_year_anne: '1987'
hmz_age: '21'
status_hmz_mar: 'no'
fd_year_svp: '1982'
toli_age: '41'

Can you please suggest?

jubilatious1
  • 2,385
  • 8
  • 16
Ashar
  • 449
  • 3
  • 10
  • 26
  • 2
    Is this a YAML file that you are manipulating? If so use a syntax aware parser like yq – Inian Mar 23 '22 at 10:33
  • Yes yaml, Can you share command for yq that will help meet the requirement? Never heard of it before – Ashar Mar 23 '22 at 10:38
  • But this is not valid `yaml`. – pLumo Mar 23 '22 at 10:39
  • Correct but i need to convert this to a valid yaml. The file data.txt is not in our control – Ashar Mar 23 '22 at 10:40
  • but your output is also not valid yaml. – pLumo Mar 23 '22 at 10:40
  • Its a variable file that is read by ansible and works alright – Ashar Mar 23 '22 at 10:42
  • Your `hmz_age` and `fd_year_anne` is duplicated in the output. – pLumo Mar 23 '22 at 10:53
  • @pLumo that was due to overlook. I made the corrections in original post – Ashar Mar 23 '22 at 10:59
  • 3
    I'm not supposed to care about what you want, but it's hard to ignore that `_age`, `status__mar` and `fd_year_` cry out for a different record structur. – Philippos Mar 23 '22 at 13:10
  • 1
    @Philippos what makes you think you're "not supposed to care about what you want"? Sometimes (often, even) the right answer to a problem involves pointing out **better** ways of doing things.....and the data samples shown here are, as you say, crying out for [normalisation](https://en.wikipedia.org/wiki/Database_normalization). And if whatever generates these psuedo-yaml files can't be fixed, then any tool that has to process them absolutely **should** attempt to do that. Even if the OP explicitly asks for garbage input data to be converted into another equally-garbage output format. – cas Mar 24 '22 at 11:16
  • I feel embarrassed reading the last line :) @cas – Ashar Mar 24 '22 at 12:43
  • 1
    Sorry, I don't mean to cause embarrassment, but it's true: the output format you asked for is no better than the input. It should be converted into some kind of record format, where each record has the same fields - name, age, marital status, fd year, etc instead of having unique keys (like hmz_age or fd_year_anne) randomly scattered throughout the data (probably grown into a horrible mess with random additions rather than designed). Depending on what you're going to use the data for, each record should probably also have some kind of unique identifier field (and names certainly aren't unique) – cas Mar 24 '22 at 13:00

5 Answers5

5

In awk:

$ awk -F': ' '
{
    count[$1]++; 
    data[$1] = $1 in data ? data[$1]", "$2 : $2 
} 
END { 
    for (id in count) { 
        printf "%s: ",id; 
        print (count[id]>1 ? "[ "data[id]" ]" : data[id])
    }
}' data.txt 
hmz_age: '21'
tom_age: '31'
fd_year_anne: '1987'
school: [ 'anne', 'svp' ]
name: [ 'tom', 'hmz', 'toli' ]
toli_age: '41'
fd_year_svp: '1982'
status_hmz_mar: 'no'
status_tom_mar: 'yes'

A Perl approach:

$ perl -F: -lane 'push @{$k{$F[0]}},$F[1]; 
        END{ 
            for $key (keys(%k)){ 
                $data=""; 
                if(scalar(@{$k{$key}})>1){ 
                    $data="[" . join(",",@{$k{$key}}) . "]"; 
                } 
                else{
                    $data=${$k{$key}}[0];
                }
                print "$key: $data"
            }
        }' data.txt 
status_tom_mar:  'yes'
fd_year_anne:  '1987'
tom_age:  '31'
toli_age:  '41'
fd_year_svp:  '1982'
hmz_age:  '21'
school: [ 'anne', 'svp']
name: [ 'tom', 'hmz', 'toli']
status_hmz_mar:  'no'

Or, a bit easier to understand maybe:

perl -F: -lane '@fields=@F; 
                push @{$key_hash{$fields[0]}},$fields[1]; 
                END{ 
                    for $key (keys(%key_hash)){ 
                        $data=""; 
                        @key_data=@{$key_hash{$key}};
                        if(scalar(@key_data)>1){ 
                           $data="[" . join(",", @key_data) . "]"; 
                        } 
                        else{
                            $data=$key_data[0]
                        }
                        print "$key: $data"
                    }
                }' data.txt 
U. Windl
  • 1,095
  • 7
  • 21
terdon
  • 234,489
  • 66
  • 447
  • 667
  • I don't have perl.. Can i have non perl solution please – Ashar Mar 23 '22 at 11:46
  • 4
    @Ashar I added an awk approach. But you don't have perl? What operating system is this? AIX? Please always mention your OS in the question because we need to know what tools and what versions of the tools are available. – terdon Mar 23 '22 at 11:55
4

A short awk program will achieve this for you

awk -F': ' '
    # Every line of input; fields split at colon+space
    {
        # Append a comma if we have previous items
        if (h[$1] > "") { h[$1] = h[$1] ", " };

        # Append the item and increment the count
        h[$1] = h[$1] $2;
        i[$1]++
    }

    # Finally
    END {
        # Iterate across all the keys we have found
        for (k in h) {
            if (i[k] > 1) { p = "[%s]" } else { p = "%s" };
            printf "%s: " p "\n", k, h[k]
        }
    }
' data.txt

Output

hmz_age: ['21', '41']
tom_age: '31'
fd_year_anne: ['1987', '1982']
school: ['anne', 'svp']
name: ['tom', 'hmz', 'toli']
status_hmz_mar: 'no'
status_tom_mar: 'yes'
roaima
  • 107,089
  • 14
  • 139
  • 261
2

In awk: awk '{arr[$1][length(arr[$1])+1]=$2}; END {for (i in arr) {printf i;if (length(arr[i])>1) {xc=" [";for (rr in arr[i]) {printf xc;printf arr[i][rr];xc=","} print "]"} else print arr[i][length(arr[i])]} }' data.txt

Output:

hmz_age:'21'
fd_year_svp:'1982'
fd_year_anne:'1987'
name: ['tom','hmz','toli']
school: ['anne','svp']
status_tom_mar:'yes'
tom_age:'31'
toli_age:'41'
status_hmz_mar:'no'
K-attila-
  • 624
  • 2
  • 13
  • 2
    I would suggest that your answer would be greatly improved if you were to explain how it works. Not everyone will be able to understand your code – roaima Mar 23 '22 at 22:59
  • 1
    Only _GNU_ awk has true array of array (not the traditional [i,j,...] using SUBSEP) – dave_thompson_085 Mar 24 '22 at 03:13
  • 3
    Never do `printf foo` for any input data as it'll fail when that data contains printf formatting chars like `%s`, use `printf "%s", foo` instead. – Ed Morton Mar 26 '22 at 21:29
  • @dave_thompson_085 i think that `gawk` was implicit. this and a few else can apply `length` to an array. – DanieleGrassini Apr 08 '22 at 22:05
1

Using Raku (formerly known as Perl_6):

raku -e 'my %h; for lines() {%h.=append: .split(":").map(*.trim).hash}; .say for %h;' 

OR

raku -e 'my %h.=append: .split(":").map(*.trim).hash for lines; .say for %h;' 

With Raku, you have hash functionalities built in (see the docs pages at bottom). Briefly, the code above takes lines, splits on ":" colon, trims whitespace from resulting 2 elements, and generates a hash (i.e. key-value pair). Each line's hash is then appended to the named %h (hash) object, and values are appropriately added to their respective keys.

Sample Input:

name: 'tom'
tom_age: '31'
status_tom_mar: 'yes'
school: 'anne'
fd_year_anne: '1987'
name: 'hmz'
hmz_age: '21'
status_hmz_mar: 'no'
school: 'svp'
fd_year_svp: '1982'
name: 'toli'
toli_age: '41'

Sample Output:

hmz_age => '21'
fd_year_svp => '1982'
status_tom_mar => 'yes'
fd_year_anne => '1987'
school => ['anne' 'svp']
status_hmz_mar => 'no'
tom_age => '31'
name => ['tom' 'hmz' 'toli']
toli_age => '41'

Once your data is in the %h object you can manipulate output. Substituting .put for .say in the code above gives tab-separated (not => separated) return. Furthermore, you can pull out values associated with individual keys like so (add below as a final statement):

say %h<name>;'
['tom' 'hmz' 'toli']

https://docs.raku.org/language/hashmap
https://docs.raku.org/language/101-basics#Hashes

jubilatious1
  • 2,385
  • 8
  • 16
-1

step1

for i in $(awk -F ":" '{a[$1]++}END{for(x in a){print x,a[x]}}' file.txt | awk '$NF>1{print $1}'|tac); do grep "^$i" file.txt >/dev/null; if [[ $? == 0 ]]; then awk -v i="$i" -F ":" '$1 == i{print $2}' file.txt|awk 'END{print "\n"}ORS=","'|sed "s/^,//g"|sed "s/,$//g"|awk -v i="$i" '{print i":["$0"]"}';else grep -v "^$i" file.txt;fi; done >output.txt

step2

for i in $(awk -F ":" '{a[$1]++}END{for(x in a){print x,a[x]}}' file.txt| awk '$NF==1'); do awk -v i="$i" -F ":" '$1 ~ i' file.txt; done >>output.txt

output

name: ['tom', 'hmz', 'toli']
school: ['anne', 'svp']
tom_age: '31'
status_tom_mar: 'yes'
fd_year_anne: '1987'
hmz_age: '21'
status_hmz_mar: 'no'
fd_year_svp: '1982'
toli_age: '41'
Praveen Kumar BS
  • 5,139
  • 2
  • 9
  • 14