0

I have a file with table format, that looks something like this:

abc00000000     1   643301  643374  Ile AAT 0   0   80.6    
abc00000000     2   1278112 1278193 Ser GCT 0   0   86.2    
abc00000000     3   1278382 1278463 Ser GCT 0   0   87.4    
abc00000000     4   1282753 1282824 Glu TTC 0   0   70.9    
abc00000001     1   138441  138512  Glu TTC 0   0   70.9    
abc00000001     2   186490  186571  Leu AAG 0   0   71.6
abc00000002     1   1342954 1343060 Tyr GTA 1342991 1343024 78.3    
abc00000002     2   1359693 1359620 Val AAC 0   0   75.1    
abc00000002     3   943029  942957  Val CAC 0   0   73.2

I just care about the first two columns.

The first column represents names of scaffolds of DNA and the second column is the number of times something different occurs in these scaffolds (let's say a mutation, that is different every time).

I try to find a command that gives me the number of mutations per scaffold: So in scaffold "abc00000000" there are 4 mutations and in scaffold "abc00000001" there are 2 mutations etc.

Maybe something with "awk" works, but I couldn't find the right command. Thank you

Jeff Schaller
  • 66,199
  • 35
  • 114
  • 250
  • Possible duplicate of [Count distinct values of a field in a file](https://unix.stackexchange.com/questions/28845/count-distinct-values-of-a-field-in-a-file) – muru Jul 29 '19 at 13:26
  • Or maybe https://unix.stackexchange.com/questions/360147/count-the-number-of-occurrences-of-a-column-value-in-a-tsv-file-with-awko – muru Jul 29 '19 at 13:29
  • 1
    Not knowing anything about the data, I notice that `abc00000000` appears four times in the file; you care about the value `4` because that's the highest value in column 2 for abc00000000, or because it's the *last* value for it, or something else? I might have simplistically added 1+2+3+4, so I'm curious what the reasoning is in selecting 4. – Jeff Schaller Jul 29 '19 at 13:36
  • Yeah sry, so you are correct, I care about the highest number of each scaffold. The data was generated so that the number of mutations for each scaffold was counted (the third and fourth column represent the place on the DNA in nucleotides) – Max Mustermann Jul 29 '19 at 13:41
  • So is it the number of times the entry shows up, or the highest value of any of those entries? They're accidentally the same here, but the logic should be correct. – Jeff Schaller Jul 29 '19 at 13:45
  • It's the highest value. For example: In scaffold "abc00000000" there are 4 mutations. The programm generates a row for each mutation. And I want to count the number of mutation in each scaffold – Max Mustermann Jul 29 '19 at 13:49
  • @MaxMustermann the confusion is because column 2 seems to be incrementing. So it looks like there is a total of 4 + 3 + 2 + 1 = 10 mutations for abc00000000, not 4. – terdon Jul 29 '19 at 14:32

3 Answers3

1

It sounds like you just want to count the number of times each scaffold's name appears in the first column. If so, you could do:

$ sort file | awk '{print $1}' | uniq -c
4 abc00000000
2 abc00000001
3 abc00000002

Or, if the file is enormous and you don't want to sort it:

$ awk '{a[$1]++}END{for(i in a){print i, a[i]}}' file 
abc00000000 4
abc00000001 2
abc00000002 3
terdon
  • 234,489
  • 66
  • 447
  • 667
0

This awk method should work for you:

awk '{ col1[$1]+=$2; next } END { for ( i in col1) print i, col1[i] }'
0

Using bash commands:

$ cut -d" " -f 1 file.txt | sort | uniq -c

$ cut -d"\t" -f 1 file.txt | sort | uniq -c

d" " : if the table is separated by space

d"\t" : if the table is separated by TAB