Extract subsequence corresponding to n:th pattern from a file

Question

I have below data blocks (multiple)

chr1.trna4 (17188416-17188486)  Length: 71 bp
Type: Gly   Anticodon: CCC at 33-35 (17188448-17188450) Score: 78.3
HMM Sc=56.60    Sec struct Sc=21.70
         *    |    *    |    *    |    *    |    *    |    *    |    *    |
Seq: GCATTGGTGGTTCAGTGGTAGAATTCTCGCCTCCCACGCGGGAGaCCCGGGTTCAATTCCCGGCCAATGCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.

For each block, I need to find the 8th pattern on the last line of the block which start with Str. In the above case, the 8th pattern is ....... (7 periods). This is because first set of > symbols make one pattern, second set of periods make second pattern and so on.

Now I need to extract the those 7 characters from the Seq line directly above the pattern line. In the example, this corresponds to the subsequence CTCCCAC.

Output should be Seq is CTCCCAC and Anticodon: CCC

Is this is possible in bash or any shell ?

More examples of the data blocks

chr19.trna11 (4724719-4724647)  Length: 73 bp
Type: Val   Anticodon: CAC at 34-36 (4724686-4724684)   Score: 79.2
HMM Sc=49.10    Sec struct Sc=30.10
         *    |    *    |    *    |    *    |    *    |    *    |    *    |  
Seq: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA
Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chr19.trna12 (1383433-1383361)  Length: 73 bp
Type: Phe   Anticodon: GAA at 34-36 (1383400-1383398)   Score: 88.9
HMM Sc=68.40    Sec struct Sc=20.50
         *    |    *    |    *    |    *    |    *    |    *    |    *    |  
Seq: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA
Str: >>>>>>>..>>>>........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chr21.trna1 (18827177-18827107) Length: 71 bp
Type: Gly   Anticodon: GCC at 33-35 (18827145-18827143) Score: 80.9
HMM Sc=60.10    Sec struct Sc=20.80
         *    |    *    |    *    |    *    |    *    |    *    |    *    |
Seq: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA
Str: >>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.



chrX.trna4 (18693101-18693029)  Length: 73 bp
Type: Val   Anticodon: TAC at 34-36 (18693068-18693066) Score: 82.9
HMM Sc=54.70    Sec struct Sc=28.20
         *    |    *    |    *    |    *    |    *    |    *    |    *    |  
Seq: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA
Str: >>>>>>>..>>>..........<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chrX.trna6 (3833344-3833271)    Length: 74 bp
Type: Ile   Anticodon: GAT at 35-37 (3833310-3833308)   Score: 75.5
HMM Sc=50.20    Sec struct Sc=25.30
         *    |    *    |    *    |    *    |    *    |    *    |    *    |   
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.


chrX.trna8 (3794915-3794842)    Length: 74 bp
Type: Ile   Anticodon: GAT at 35-37 (3794881-3794879)   Score: 75.5
HMM Sc=50.20    Sec struct Sc=25.30
         *    |    *    |    *    |    *    |    *    |    *    |    *    |   
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.



chrX.trna10 (3756491-3756418)   Length: 74 bp
Type: Ile   Anticodon: GAT at 35-37 (3756457-3756455)   Score: 75.5
HMM Sc=50.20    Sec struct Sc=25.30
         *    |    *    |    *    |    *    |    *    |    *    |    *    |   
Seq: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA
Str: >>>>>>>..>>>>.........<<<<.>>>>>.......<<<<<.....>>>>>.......<<<<<<<<<<<<.

chr19.trna8 (45981945-45981859) Length: 87 bp
Type: SeC   Anticodon: TCA at 36-38 (45981910-45981908) Score: 146.9
HMM Sc=0.00 Sec struct Sc=0.00
         *    |    *    |    *    |    *    |    *    |    *    |    *    |    *    |    * 
Seq: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG
Str: >>>>>>>.>..>>>>>>....<<<<<<<<<<<<.......<<<<<<.>>>>>....<<<<<.>>>>.......<<<<<.<<<<<<<.

a pattern in this case is defined by anything that starts with a character and stops with same character. so if I take the line starting with Str: then I see there are 14 patterns (>>>>>>>(1) ..(2) >>>>(3) .......(4) <<<<(5) .(6) >>>>>(7) .......(8) <<<<<(9) ....(10) >>>>>(11) .......(12) <<<<<<<<<<<<(13) .(14)) — MO12, Nov 27 '19 at 17:50
Can we treat `<`, `>`, `.` as fixed values that compose the pattern? Could there be other variations, perhaps, you could also post them — RomanPerekhrest, Nov 27 '19 at 17:59
Yes, you can treat them fixed values for pattern. And yes , the pattern variation is different on different blocks of the data. some examples added in the question — MO12, Nov 27 '19 at 18:14
Sorry for formatting error when I copy pasted here. Its not inconsistent. I am editing my question — MO12, Nov 27 '19 at 18:27
Is the sequence you want _always_ the anticodon with two additional basepairs on either side? — Kusalananda, Nov 27 '19 at 18:29
@Kusalananda ,my apologies. You are right. I was trying to elude that the base pairs could be different. and Yes, the seq I want is always the the anticodon with addition basepairs and you will also see the at which character the anticodon needed is mentioned. like Anticodon: CCC at 33-35. So the sequence will be characters from 31-37 on the line Seq. — MO12, Nov 27 '19 at 18:45
@glenn jackman , yes it would. I just added one more example to end of the question — MO12, Nov 27 '19 at 18:51
Can you count on the length of the 8th pattern being 7? Given `Anticodon: CCC at 33-35` will the seq you want to extract always be from indices 31-37 ? — glenn jackman, Nov 27 '19 at 19:04
Yes., length on 8th pattern can be counted as 7. It changes very rarely, but i can adjust it — MO12, Nov 27 '19 at 19:09
I originally voted up, it seemed to be an interesting question. Now I'm revoking my vote because of the "shortcut" that was revealed in comments but not edited into the question body. I wasted my time building a solution that actually parses these `>>...<<<`. — Kamil Maciorowski, Nov 27 '19 at 21:05
@KamilMaciorowski - I was working on this based on the original question. I just realize the Anticodon pattern after @ Kusalananda questioned it. Sorry if I misled you. — MO12, Nov 27 '19 at 21:09
I published my code anyway. The wasted time not so wasted after all. — Kamil Maciorowski, Nov 27 '19 at 23:05
Oh! The "shortcut" *should* be edited into the question body. Without it few answers seem not to make sense until one reads the right comment (if ever). — Kamil Maciorowski, Nov 27 '19 at 23:34

Kusalananda · Answer 1 · 2019-11-28T14:35:50.357

Using awk:

$ awk -f script.awk file
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC

Where script.awk is the following awk program:

/^Type:/ {
        type = $2
        anticodon = $4
        split($6, pos, "-")
}

/^Seq:/ {
        seq = substr($2, pos[1]-2, length(anticodon) + 4)
        # or: seq = substr($2, pos[1]-2, pos[2]-pos[1]+5)
        printf "Sequence: %s, Anticodon: %s, Type: %s\n", seq, anticodon, type
}

The first block is triggered by any line starting with the string Type: and it picks out the type and anticodon sequence from the 2nd and 4th whitespace-delimited fields and splits the 6th such field on - to produce the start and end coordinates in the sequence.

The second block is triggered by a line starting with the string Seq: and it picks out the sequence from the 2nd whitespace-delimited field using the start position of the anticodon and the anticodon's length read from the latest Type: line, making sure to get a couple of base-pairs on either side.

The output is then produced.

The following sed script uses the 8th "pattern" from the Str: line to extract the wanted sequence rather than the numerical positions for the anticodon given on the Type: line.

/^Type:[[:blank:]]*/ {
        s/.*Type: \([^[:blank:]]*\)[[:blank:]]*Anticodon: \([^[:blank:]]*\).*/ Anticodon: \2, Type: \1/
        h
}

/^Seq:[[:blank:]]*/ {
        s//Sequence: /
        G
        y/\n/,/
        w data.tmp
}

/^Str:[[:blank:]]*/ {
        s///
        s,\(\(\([<>.]\)\3*\)\{7\}\)\(\([<>.]\)\5*\).*,s/: \1\\(\4\\)[^\,]*/: \\1/;n,
        y/<>/../
        w pass2.sed
}

d

(the trailing d is not a typo).

It does so in two passes.

In the first pass, two new files are created, data.tmp and pass2.sed.

$ sed -f script.sed file

(there is no terminal output from this)

For the given data, data.tmp will look like

Sequence: GTTTCCGTAGTGTAGCGGTtATCACATTCGCCTCACACGCGAAAGGtCCCCGGTTCGATCCCGGGCGGAAACA, Anticodon: CAC, Type: Val
Sequence: GCCGAAATAGCTCAGTTGGGAGAGCGTTAGACTGAAGATCTAAAGGtCCCTGGTTCGATCCCGGGTTTCGGCA, Anticodon: GAA, Type: Phe
Sequence: GCATGGGTGGTTCAGTGGTAGAATTCTCGCCTGCCACGCGGGAGGCCCGGGTTCGATTCCCGGCCCATGCA, Anticodon: GCC, Type: Gly
Sequence: GGTTCCATAGTGTAGTGGTtATCACGTCTGCTTTACACGCAGAAGGtCCTGGGTTCGAGCCCCAGTGGAACCA, Anticodon: TAC, Type: Val
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GGCCGGTTAGCTCAGTTGGTaAGAGCGTGGTGCTGATAACACCAAGGtCGCGGGCTCGACTCCCGCACCGGCCA, Anticodon: GAT, Type: Ile
Sequence: GCCCGGATGATCCTCAGTGGTCTGGGGTGCAGGCTTCAAACCTGTAGCTGTCTAGCGACAGAGTGGTTCAATTCCACCTTTCGGGCG, Anticodon: TCA, Type: SeC

while pass2.sed is a sed script that post-processes this:

s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ..............................\(.......\)[^,]*/: \1/;n
s/: ...............................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: ................................\(.......\)[^,]*/: \1/;n
s/: .................................\(.......\)[^,]*/: \1/;n

Applying pass2.sed onto data.sed gives you the final result:

$ sed -f pass2.sed data.tmp
Sequence: CTCACAC, Anticodon: CAC, Type: Val
Sequence: CTGAAGA, Anticodon: GAA, Type: Phe
Sequence: CTGCCAC, Anticodon: GCC, Type: Gly
Sequence: TTTACAC, Anticodon: TAC, Type: Val
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTGATAA, Anticodon: GAT, Type: Ile
Sequence: CTTCAAA, Anticodon: TCA, Type: SeC

Note: I'm not sure how the second step performs on very large datasets.

can I also print 'Type' and its associated variable like below Sequence is CTCACAC and Anticodon is CAC , Type: Gly From your script, I see you based your logic on the anticode, rather than the 8th pattern position, is that right? — MO12, Nov 27 '19 at 20:52
@MO12 That is correct. Getting a substring at a known numerical location is less fiddly than having to calculate the position using some regular expression. To get the type, just say `type = $2` in the first block and make sure to incorporate that in the output. — Kusalananda, Nov 27 '19 at 20:54
@MO12 I wrote a `sed` thing that will use counting of the patterns too. — Kusalananda, Nov 28 '19 at 07:53

glenn jackman · Accepted Answer · 2019-11-27T19:20:03.063

Given that we can extract the starting index together with the anticodon:

len=7
prior=2

while IFS= read  -r line; do
    if [[ $line =~ Anticodon:" "([[:alpha:]]+)" at "([0-9]+) ]]; then
        anticodon=${BASH_REMATCH[1]}
        start=$(( BASH_REMATCH[2] - 1))  # string indexing is zero-based
    elif [[ $line == "Seq: "* ]]; then
        seq=${line#Seq: }
        printf "Seq: %s, Anticodon: %s\n" "${seq:start-prior:len}" "$anticodon"
    fi
done < file

A more complex solution that parses the "Str:" line each time, but does not hardcode the length as 7 (it does hardcode the "nth" pattern):

8thSeq() {
    local seq=$1 str=$2
    local last=${str:0:1}
    local nth=8 n=1 start

    for (( i=1; i < ${#str}; i++)); do
        if [[ "${str:i:1}" != "$last" ]]; then
            ((n++))
            if ((n == nth)); then
                start=$i
            elif ((n == nth+1)); then
                echo "${seq:start:i-start}"
                break
            fi
        fi
        last=${str:i:1}
    done
}

while IFS= read  -r line; do
    if [[ $line =~ Anticodon:" "([[:alpha:]]+) ]]; then
        anticodon=${BASH_REMATCH[1]}
    elif [[ $line == "Seq: "* ]]; then
        seq=${line#Seq: }
    elif [[ $line == "Str: "* ]]; then
        str=${line#Str: }
        printf "Seq: %s, Anticodon: %s\n" "$(8thSeq "$seq" "$str")" "$anticodon"
    fi
done < file

Using the "more" data, both solutions output

Seq: CTCACAC, Anticodon: CAC
Seq: CTGAAGA, Anticodon: GAA
Seq: CTGCCAC, Anticodon: GCC
Seq: TTTACAC, Anticodon: TAC
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTGATAA, Anticodon: GAT
Seq: CTTCAAA, Anticodon: TCA

You'll get more terse programs with awk or perl. I don't think sed is good choice here: I'm sure it can be done, but IMO complex sed programs are hard to grok. — glenn jackman, Nov 27 '19 at 19:27
I will try this soon. I dont need this to be in sed. I am open to pretty much anything. — MO12, Nov 27 '19 at 19:51
can I also print 'Type' and its associated variable along with the output. Seq: CTCACAC, Anticodon: CAC Type: Gly — MO12, Nov 27 '19 at 20:47
Sure, you just need to alter the regex pattern in the first `if` statement, and the indices for the BASH_REMATCH array will need to update accordingly (in that array, the elements at index 1, 2, etc are the contents of the capturing parentheses) — glenn jackman, Nov 27 '19 at 21:10

score 2 · Answer 3 · answered Nov 27 '19 at 22:51

Assuming that you need to parse the repetitions of the Str string:

start and end

Since the sequence of patterns could change for each block we need a way to find the 8th pattern.

It is possible to extract each repeated "pattern" (from your description anything that starts with a character and stops with same character) from the str with (GNU) grep:

$ str='>>>>>>>..>>>>.......<<<<.>>>>>.......<<<<<....>>>>>.......<<<<<<<<<<<<.'

$ grep -Eo '(.)\1+' <<<"$str"
>>>>>>>
..
>>>>
.......
<<<<
>>>>>
.......
<<<<<
....
>>>>>
.......
<<<<<<<<<<<<

So, the start and length of the 8 pattern (using the shell) is:

pattern=8
splitstr=( $(grep -Eo '(.)\1+' <<<"$str") )
for((i=1;i<=pattern-2;i++)); do
    start=$((start+${#splistr[i]}))
done
len=${splitstr[pattern-1]}

For any pattern (that has 8 or more repetitions).

Or, shorter, start and end:

start=$(echo "$str" | grep -Eo '^((.)\2+|.){7}'); start=${#start}
  end=$(echo "$str" | grep -Eo '^((.)\2+|.){8}');   end=${#end}

blocks

In AWK: It is possible (and simple) to break the file into blocks (lines separated by an empty line) by setting RS to empty "".

fields

If RS is "" each block is further divided into fields automatically by awk. Being the last field ($NF in awk parlance) the str that contains repeated characters.

So, in awk:

$ awk -vRS="" '{str=$NF; pat=8
cmd1="echo \"" str "\" | grep -Eo '\''^((.)\\2+|.){" pat-1 "}'\''";
cmd2="echo \"" str "\" | grep -Eo '\''^((.)\\2+|.){" pat   "}'\''";
cmd1 | getline start ; close(cmd1) ; start=length(start)
cmd2 | getline end   ; close(cmd2) ;   end=length(end)
print "Start:",start,"End:",end,"Sequence:",substr($(NF-2),start,end-start),"Anticodon:",$9,"Type:",$7
}' biopattern.txt


Start: 30 End: 37 Sequence: CCTCCCA Anticodon: CCC Type: Gly
Start: 31 End: 38 Sequence: CCTCACA Anticodon: CAC Type: Val
Start: 31 End: 38 Sequence: ACTGAAG Anticodon: GAA Type: Phe
Start: 30 End: 37 Sequence: CCTGCCA Anticodon: GCC Type: Gly
Start: 31 End: 38 Sequence: CTTTACA Anticodon: TAC Type: Val
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 32 End: 39 Sequence: GCTGATA Anticodon: GAT Type: Ile
Start: 33 End: 40 Sequence: GCTTCAA Anticodon: TCA Type: SeC

Which are not the same results of other answers based on the number after at.

Maybe: Is this what you meant?

score 2 · Answer 4 · answered Nov 27 '19 at 23:04

This is my approach that actually uses these >>.......<< to find the desired sequence, as it was requested in the original question. I started working on it before this shortcut was found. It turned out to be a pleasant exercise with sed (although the approach may be far from optimal). I'm posting it here as an example that sed can do this.

<data sed -E '
/^Type:|^Seq:|^Str:/ ! d
/^Type:/ {
   s/.*(Anticodon: [CGAT]*).*/\1/
   p; d
   }
/^Seq:/ {
   s/[^CGATcgat]*([CGATcgat]*).*/-\1-/
   h; d
   }
/^Str:/ {
   s/[^><\.]*([><\.]*).*/\1/
   s/([^>])(>)/\1X\2/g
   s/([^<])(<)/\1X\2/g
   s/([^.])(\.)/\1X\2/g
   s/X./X/g
   s/./X/
   s/((X[^X]*){7})X/\1M/

   : deleting
   x; s/.//
   t forget
   : forget
   x; s/^[^M]//
   t deleting

   s/M//

   : moving
   x; s/(.)(.*)/\2\1/
   t forget2
   : forget2
   x; s/^[^X]//
   t moving

   g; s/.*-//
   }
' | sed -E '
/^Anticodon:/ { h; d}
/^Anticodon:/ ! {
   s/^/Seq: /
   s/$/, /
   G
   s/\n//
   }
'

It works like this:

The first sed
- Lines not starting with Type:, Seq: or Str: are deleted.
- For a line starting with Type: the Anticodon: information is extracted and printed.
- For a line starting with Seq: the useful string like CGAT… gets extracted and embedded with - characters (they will be useful later). The result is stored in the hold space.
- For a line starting with Str::
  - The useful string of >>...<<… gets extracted.
  - X is inserted whenever the sequence changes; consecutive characters are deleted; X replaces the first character. In the result there is X in place of the first character of every "pattern".
  - 8th X is replaced with M.
  - deleting loop switches between the two lines and deletes leading characters one by one until M is encountered. When M is encountered, the other string has already been reduced one excessive time. To compensate this the leading - was added earlier.
  - moving loop switches beetween the two lines as well. It moves CGAT characters to the end one by one, so they pop up after the trailing -. The same loop removes characters from the "pattern" line one by one until X is encountered. Like with M earlier, when X is encountered, the other string has already been shifted one excessive time. To compensate this M was removed with a command outside of the loop (s/M//).
  - The desired string is now after - (which used to be the trailing -) in the hold space. We copy it to the pattern space, remove everything up to -. The result gets printed.
The second sed
- It's there to post-process data: to add a label, to format, to assemble one line per record.

score 1 · Answer 5 · answered Nov 28 '19 at 16:01

Operating perl in the paragraph mode -00 and loop over all the paragraphs one by one -n. First we fill up the type, anticodon, sequence, and str variables by looking at their properties in the current para, aka, $_.

$ perl -n00e '
   my($type, $anticodon, $seq, $str) = 
      / (?= .*\nType:      \h+ (\S+)  )
        (?= .*\hAnticodon: \h+ (\S+)  )
        (?= .*\nSeq:       \h+ (\S+)$ )
        (?= .*\nStr:       \h+ (\S+)$ )
      /xms;
   $str =~ /^((.)\2*){7}((.)\4*)/g;
   my($pos_codon, $len_codon) = (pos($str), length($3));
   my $codon = substr($seq, $pos_codon-$len_codon, $len_codon);
   print "Codon:[$codon] Anticodon:[$anticodon] Type:[$type]\n";
' file

Results:

Codon:[CTCACAC] Anticodon:[CAC] Type:[Val]
Codon:[CTGAAGA] Anticodon:[GAA] Type:[Phe]
Codon:[CTGCCAC] Anticodon:[GCC] Type:[Gly]
Codon:[TTTACAC] Anticodon:[TAC] Type:[Val]
Codon:[CTGATAA] Anticodon:[GAT] Type:[Ile]
Codon:[CTGATAA] Anticodon:[GAT] Type:[Ile]
Codon:[CTGATAA] Anticodon:[GAT] Type:[Ile]
Codon:[CTTCAAA] Anticodon:[TCA] Type:[SeC]

schrodingerscatcuriosity · Answer 6 · 2019-11-27T23:34:40.377

My two cents

while IFS= read -r line; do
  [[ "$(echo $line | grep "Type:")" ]] && codon="$(echo $line | cut -d' ' -f4)" && start="$(echo $line | cut -d' ' -f6)" && type="$(echo $line | cut -d' ' -f2)" && continue
  [[ "$(echo $line | grep "Seq:")" ]] && seq="${line##Seq: }" && continue
  if [[ -n "$seq" && -n "$codon" ]]; then 
    pos="${start%-*}" && pos="$((${pos}-3))"
    echo Seq: "${seq:$pos:7}" Codon: "$codon" Type: "$type"
    seq=
    codon=
    continue
  fi
done < file

Output:

Seq: CTCACAC Codon: CAC Type: Val
Seq: CTGAAGA Codon: GAA Type: Phe
Seq: CTGCCAC Codon: GCC Type: Gly
Seq: TTTACAC Codon: TAC Type: Val
Seq: CTGATAA Codon: GAT Type: Ile
Seq: CTGATAA Codon: GAT Type: Ile
Seq: CTGATAA Codon: GAT Type: Ile
Seq: CTTCAAA Codon: TCA Type: SeC

Find the line that starts with "Type" and extract "Type", "codon" and "position (from 33-35 to just 33)"

[[ "$(echo $line | grep "Type:")" ]] && codon="$(echo $line | cut -d' ' -f4)" && start="$(echo $line | cut -d' ' -f6)" && type="$(echo $line | cut -d' ' -f2)" && continue

Grab the line that starts with "Seq:" and extract the dna sequence:

[[ "$(echo $line | grep "Seq:")" ]] && seq="${line##Seq: }" && continue

If both $seq and $codon variables are set, print the result, then unset the variables, and then continue:

if [[ -n $seq && -n $codon ]]; then 
  pos="${start%-*}" && pos="$((${pos}-3))"
  echo Seq: ${seq:$pos:7} Codon: $codon Type: $type
  seq=
  codon=       
  continue
fi

Rakesh Sharma · Answer 7 · 2019-11-28T03:37:16.220

One approach can be using perl in paragraph mode and then split the para in newlines. The last line of the para we use to determine our positions and lengths of cordons and then use these numbers to go grab the data from the just preceding line.

$ perl -F\\n -l -00 -nae '
    $F[-1] =~ /^Str:\s+((.)\2*){7}((.)\4*)/g;
    my $c = substr($F[-2],pos($F[-1])-length($3),length($3));
    my $a = substr($c, 2, 3);
    print "seq:$c anti:$a";
' file.gene

seq:CTCCCAC anti:CCC

Brief Explanation:

Each record is a paragraph. Then that

record is split around newlines and the resultant pieces are stored in a zero indexed array @F. The last element $F[-1]is then scanned for repeating Sequences.

((.)\1*) is a regex for one set if

consecutive characters. Apply the braces {7} to this n u get 7 such Sequences. The next will be the 8th and which us wot we want.

can you explain the first line please? – MO12 Nov 28 '19 at 03:18 — MO12, Nov 28 '19 at 03:18
Check the brief working. – Rakesh Sharma Nov 28 '19 at 03:32 — Rakesh Sharma, Nov 28 '19 at 03:32

score 0 · Answer 8 · answered Dec 08 '19 at 07:48

Here is an awk script which does not depend on the Type: shortcut:

function getpos( str, seqndx )
{
   ndx = 1
   i = 1
   strlen = length($tr)

   split(str, chars, "")
   mchar = chars[1]

   for (; i <= strlen; i++) {
       if (mchar != chars[i])  {
           mchar = chars[i]
           if (++ndx == seqndx)
               break
       }
   }

   seqstart = i
   for (; i <= strlen; i++) {
       if (mchar != chars[i])
           break
   }

   return seqstart " " --i
}

/^Type:/ {
    anticodon = $4
}

/^Seq:/ {
    seqstr = $2
}

/^Str:/ {
    posstr = getpos( $2, 8 )
    split(posstr, pos)
    seq = substr(seqstr, pos[1], pos[2] - pos[1] + 1)
    printf "Sequence: %s, Anticodon: %s\n", seq, anticodon
}

Here is the output produced by this script:

$ awk -f script.awk infile
Sequence: CTCACAC, Anticodon: CAC
Sequence: CTGAAGA, Anticodon: GAA
Sequence: CTGCCAC, Anticodon: GCC
Sequence: TTTACAC, Anticodon: TAC
Sequence: CTGATAA, Anticodon: GAT
Sequence: CTGATAA, Anticodon: GAT
Sequence: CTGATAA, Anticodon: GAT
Sequence: CTTCAAA, Anticodon: TCA
$

score -1 · Answer 9 · answered Nov 27 '19 at 19:52

-1

short example, if I got the question correctly.

if FILE.txt contains provided by you data, the output below the string to execute:

for I in $(cat FILE.txt|egrep '^Seq:\s|^Str:\s'|sed ':a;N;$!ba;s/Seq:\s*//g;s/\nStr:\s*/|/g;'); do B=$(echo "$I"|cut -d'|' -f1); A=$(echo "$I"|cut -d'|' -f2); [ ${#A} ] && s=${A:0:1} && s1=0 && s2=1 && n=1 && i=1 && while [ $i -lt ${#A} ] ; do [ "${A:$i:1}" = "$s" ] && s2=$(($s2+1)) || { n=$(($n+1)); if [ $n -lt 9 ]; then s=${A:$i:1}; s1=$i; s2=1; else echo "${B:$s1:$s2}"; i=${#A}; fi; }; i=$(($i+1)); done ; done

CTGCCAC

TTTACAC

CTGATAA

answered Nov 27 '19 at 19:52

Yurko

688
3
4

5

This is unreadable. Please edit and apply some formatting – glenn jackman Nov 27 '19 at 20:06
well... this is one line command to execute... sorry, everybody has his own style :) – Yurko Dec 02 '19 at 14:23

Extract subsequence corresponding to n:th pattern from a file

9 Answers9

start and end

blocks

fields