2

I have the following input:

cat moldata
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_2
CCAAGGTCATCCATGACAACTTTGAGATCATTGAAGGCCTGATGACCACTGTACACGCCACCACCGCTACTCAGAAGA
GTCGACGGACCTTCCGGTAAACTCTGGCGTGATGGTCGTGGCGCTCAACAAAACATCATTCCCGCCTCTACTGGTGCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_4
GGTCCTTTGAAGGGTATTCTTGGATACACCGAAGATCAAGTTGTGTCCACCGACTTTGTTGGAGACACACACTCTTCA
CTTTGACGCTGCTGCTGGTATCTCCCTCAACGATAACTTCGTCAAACTTATCAGCTGGTACGACAATGAATATGGATA

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

and I would like to retrieve from moldata the species record (species name + the block of lines following it until the next species record) for the species listed here:

cat species_list
species_1
species_3
species_5

to obtain the following output:

cat output
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

I tried to make awk work in a while loop:

while read line; 
do 
    if grep -q "$line" moldata; 
    then echo $line |  awk -v line=${line} 'BEGIN {RS=">"} /line/ {print $0}' moldata >> output; 
    else echo "$line not found"; 
    fi; 
done < species_list

I read about the getline option with awk but I couldn't make it work.

Kusalananda
  • 320,670
  • 36
  • 633
  • 936
jibbah
  • 21
  • 1
  • 2
  • 4
  • 2
    See [Extracting subset from fasta file](http://unix.stackexchange.com/q/253499) and [Getting matched fasta file](http://unix.stackexchange.com/q/156783) - if your SEQs are always split on two lines simply run `paste -d'>' /dev/null species_list | grep -A2 -Fxf - moldata` – don_crissti Dec 19 '16 at 22:22

5 Answers5

1

If you insist to do it with awk:

( echo -n '/^>('; paste -sd\| - <species_list | tr -d '\n'; echo -n ')/,/^$/' ) | \
    awk -f - moldata
Satō Katsura
  • 13,138
  • 2
  • 31
  • 48
1

For single record you could do something like this:

awk '/species_1/{print;while (getline line){if(line !~/species/) print line; else break} }' input.txt                  
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

With multiple items , you might want to something along these lines:

$ while IFS= read -r line                                                                                                
> do
>   awk -v spec="$line" '$0~spec{print;while (getline line){if(line !~/species/) print line; else break} }' input.txt    
> done < species.txt                                                                                                     
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG

Alternative solution using two files as arguments

Knowing that data in OP's example is always two lines after the species name, we can load the species names in species.txt into array, and then use for loop to read a line twice and print it out.

$ awk 'FNR==NR{species[$0]}; NR!=FNR{ sub(/>/,"");if ($0 in species){ print $0; for(i=0;i<=1;i++) {getline data;print data}  }}' species.txt input.txt
species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT
species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG
species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG
Sergiy Kolodyazhnyy
  • 16,187
  • 11
  • 53
  • 104
0
declare -a list
readarray list < /path/to/species_list
for species in ${list[@]}; do
   sed -ne "/$species/,/^\$/p" moldata
done

Ordinarily, one would single-quote an inline sed script, but we are using a shell variable to seek out your data, and therefore have to escape the $ to keep the shell from trying to parse it.

DopeGhoti
  • 73,792
  • 8
  • 97
  • 133
  • Two problems here: 1) What if some species names contain spaces ? (even if that's not possible with the given data doing `for f in $(cat stuff)` is almost always _bad_). 2) These guys usually work with huge files. If the input had 1 million lines and you wanted to extract the data for 200 species your code would process 200 million lines. – don_crissti Dec 19 '16 at 20:56
  • 1) Not given as example input or output, so irrelevant, but addressable. 1a) fixed. 2) Not addressed in question, so irrelevant, but could easily be rewritten to assemble a `sed` script which would only need to run once on the given input file, and then run it. – DopeGhoti Dec 19 '16 at 21:25
  • Thank you, that worked well when my sequences are interspaced. But what if there are no spaces between the sequences (it is often the case)? how do you use sed to cut the text from the sequence wanted to the next ">" (i. e. the beginning of the next sequence)? – jibbah Dec 20 '16 at 09:57
  • You can probably do that with changing the end-marker from `/^\$/` to `/^>/`. What that does is change the end-marker from 'blank line' to 'any line starting with a `>`'. You may need to escape the `>` with `/^\>/`. – DopeGhoti Dec 20 '16 at 15:59
0
$ awk '
    NR==FNR{ sp[$0]; next }        # this is for reading species data into sp array
    /species/ {                    # from this point for moldata's lines
        if (substr($0,2) in sp) {  # if the read line match with data in sp array
            print                  # first print title
            while(getline > 0) {   # and then read until blank line and print
                if (NF == 0) { print ""; break }
                print 
            } 
        }
    }
' species_list moldata 
>species_1
?????????CACTTGGArGGTGGAGCCAAGAAGGTTATTATTTCTGCTCCCAGTGCTGACGCGCCCATGTACGTGGTC
TGTCAACCTCGATTCTTATGACCCATCTGCTAAGGTCATTTCGAATGCTTCCTGCACCACCAACTGCCTCGCTCCCCT

>species_3
CAAAGCCGTAGGCAAAGTCATTCCTGCTCTCAACGGTAAACTGACTGGCATGGCCTTCCGTGTTCCCGTTCCAAATGT
CGGTTGTGGATCTTACTGTTCGCyTGGGAAAACCAGCCTCTTATGACrCCATTAAACAGAAGGTCAAGGAGGCTGCTG

>species_5
GTTCCGCAAAGCTCAATGCCCTATTGTTGAGCGTCTGACCAATTCTCTCATGATGCATGGCCGCAACAACGGCAAGAA
TGATGGCAGTGCGAATTGTTAAGCATGCCTTTGAAATCATCCACCTTCTGACTGGAGAGAATCCTCTTCAAGTACTCG
mug896
  • 957
  • 9
  • 12
0

"By a special dispensation, an empty string as the value of RS indicates that records are separated by one or more blank lines", so

awk '    BEGIN { OFS=FS="\n"; RS=""; ORS="\n\n" }
         FNR==NR { for (k=0;k<NF;) ++species[">"$++k] }
         $1 in species
' species_list moldata
jthill
  • 2,671
  • 12
  • 15