How to parse a file and write to separate columns

Question

I am trying to parse the output from antismash to count the number of BGC. I found the scripts that people have with python which I know nothing about, so I am trying to figure this out using bash script.

The gene bank format files with predicted clusters look like this:

head -30 sca_11_chr8_3_0.region001.gbk
LOCUS       sca_11_chr8_3_0        45390 bp    DNA     linear   UNK 01-JAN-1980
DEFINITION  sca_11_chr8_3_0.
ACCESSION   sca_11_chr8_3_0
VERSION     sca_11_chr8_3_0
KEYWORDS    .
SOURCE
  ORGANISM
            .
COMMENT     ##antiSMASH-Data-START##
            Version      :: 6.0.1-a859617(changed)
            Run date     :: 2021-10-31 18:00:02
            NOTE: This is a single cluster extracted from a larger record!
            Orig. start  :: 169481
            Orig. end    :: 214871
            ##antiSMASH-Data-END##
FEATURES             Location/Qualifiers
     protocluster    1..45390
                     /aStool="rule-based-clusters"
                     /contig_edge="False"
                     /core_location="join{[194767:194871](-),
                     [194650:194652](-), [191596:194619](-),
                     [189481:191503](-)}"
                     /cutoff="20000"
                     /detection_rule="cds(Condensation and (AMP-binding or
                     A-OX))"
                     /neighbourhood="20000"
                     /product="NRPS"
                     /protocluster_number="1"
                     /tool="antismash"
     proto_core      complement(join(20001..22022,22116..25138,25170..25171,

tail sca_11_chr8_3_0.region001.gbk

    44881 ggagcttgtg gagagaagtg agacgtatcg cacgaatgct cttcagcaga tgctgggcag
    44941 ttagaggatt tgcactttag tttcatagag ttgatgtgtc gaggagataa tttgagatac
    45001 cagtatatgt aatttaccta cctacctagt cgagattgga cattgtacaa gagaaataac
    45061 aactaactat acgagacaag cctgatgtgt tgatagtttc attcatgtct ggtgtttgtg
    45121 gcatgtttat gttggagtag ctgtacagaa gataccgcgc tattcccagt gatcatggcc
    45181 cccacgcctc caactcggca cctgaccttg atcccctttg ggaagcatgt ctcagtgtct
    45241 cagccgtgag ccgtagaggc tgcacagcat ggagaagctg tcctgtcaat tcaggggatt
    45301 tgcccacggg ggctatcata tgatgaatct cggacaccct acacgttgtt accgcctttc
    45361 ttagctcctg ctggtagccg tcccctgaac
//

At first I concatenated the gbk files into one so it contains all the predicted clusters and then grep the characters that gave me the locus id, start and end of cluster and cluster type.

cat sca_*.gbk > Necha2_SMclusters.gbk
grep "DEFINITION\|Orig\|product=" Necha2_SMclusters.gbk > Necha2_SMclusters_filtered.txt

which gives me a file like this

DEFINITION  sca_32_chr11_3_0.
            Orig. start  :: 381231
            Orig. end    :: 428233
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"
DEFINITION  sca_32_chr11_3_0.
            Orig. start  :: 464307
            Orig. end    :: 486217
                     /product="terpene"
                     /product="terpene"
                     /product="terpene"
                     /product="terpene"
DEFINITION  sca_33_chr6_1_0.
            Orig. start  :: 140267
            Orig. end    :: 227928
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"
DEFINITION  sca_39_chr11_5_0.
            Orig. start  :: 270154
            Orig. end    :: 324310
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"

From this file I want to obtain a file which looks like this.

Locus name  start   end ClusterType
sca_9_chr7_10_0.    369577  421460  T1PKS,NRPS
sca_33_chr6_1_0.    140267  227928  NRPS-like, T1PKS
sca_32_chr11_3_0    381231  428233  T1PKS

For now this is what I need a file with all predicted clusters in it.

Thank you so much!!

What have you written or attempted so far, script-wise? Please [edit] and add your script. — Greenonline, Nov 01 '21 at 15:40
Don't write a bash (or any other shell) script just to manipulate text as the result will be relatively slow, fragile, non-portable, complicated, etc. (see https://unix.stackexchange.com/questions/169716/why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some of the issues). The people who invented shell also invented awk for shell to call to manipulate text. You can learn awk from the book Effective AWK Programming, 5th Edition, by Arnold Robbins. — Ed Morton, Nov 01 '21 at 15:43
So far you've provided 1 input record (a block of lines) to map to 1 output record (a single line). If your real data contains multiple records then show at least 2 in your sample input/output so we can help you. Whatever separates records is every bit as important as the contents of each record. You also say "I have multiple files so eventually" so a) show us at least 2 input files and b) don't say "eventually..." - just show the output you want given those multiple input files as the solution for N is almost certainly not going to be "call the solution for 1 N times". — Ed Morton, Nov 01 '21 at 15:47
Are you trying to parse a genbank entry file? There are dedicated tools for this, are you sure you want to make one yourself? If not, we would need to see what else is in the file in order to be able to help you. Those are just the lines you are interested in, but without knowing what else, we cannot figure out how to separate the lines you want from the lines you do not want. Please [edit] your question and clarify. — terdon, Nov 01 '21 at 16:16
Also, remember that this isn't a bioinformatics site. People here will have no clue what "antismash" or "BGC" or "gbk output file" are. You need to explain. For what it's worth, I am a professional bioinformatician and have been working in the field for almost 20 years and I only guessed that "gbk" meant genbank but have no clue what the other two are. — terdon, Nov 01 '21 at 16:18
@terdon I justr edited my question I hope I am much more clear now. If not could you pleas ealso direct me to the programs that can parse gbk files. — ambika, Nov 01 '21 at 16:19
Thanks for the edit, that helps! But why isn't there an output line where `ClusterType` is `terpene`? How can we know to skip that? I have written scripts using [SwissKnife](http://swissknife.sourceforge.net/) in Perl for this and I am sure there will be dedicated parses included in both bioPython and bioPerl. — terdon, Nov 01 '21 at 16:20
@EdMorton Thank you for your suggestions I changed my question as per your suggestions. I hope I am more clear now. — ambika, Nov 01 '21 at 16:21
@terdon I need everything I dont want to skip anything, I just showed 3 scaffolds as an example on how it should look like. — ambika, Nov 01 '21 at 16:23
But you show the output of `cat sca_*.gbk > Necha2_SMclusters.gbk grep "DEFINITION\|Orig\|product=" Necha2_SMclusters.gbk` as the sample input while you don't need to do any of that if you're using awk and it's probably making what you need to do with the awk script harder than it has to be as that's stripping whatever separates the blocks and/or combining them into one stream, you can just run awk directly on `sca_*.gbk` so THAT is the sample input you should provide, not the output of running some other tools on them. — Ed Morton, Nov 01 '21 at 16:26
@EdMorton Thank you for bringing that up. Yes I realized I can just avoid that step if I could figure out how to write the results I need, that's why I just posted a single input file before. I don't know how to move forward from here. Could you please help. — ambika, Nov 01 '21 at 16:33
It's not just that you **can** avoid that step, you **must** avoid that step or you make the next step (the one you're asking for help with) harder. Just create 2 or 3 small files that follow the format of your `sca_*.gbk` files in terms of what each record contains (the 3 fields you want plus a couple you don't), 1 or more records per file, and separators between records to be the sample input and show the ouput you expect given that input. — Ed Morton, Nov 01 '21 at 16:37
@EdMorton see http://scikit-bio.org/docs/0.5.2/generated/skbio.io.format.genbank.html. Records are separated by `//` which always (at least in my experience) appear by themselves and as the only characters on the line. — terdon, Nov 01 '21 at 16:38
Thanks @terdon but I'd rather get the OP to just provide a [MCVE](https://stackoverflow.com/help/minimal-reproducible-example) with concise, testable sample input and expected output so they learn how to ask their next question. — Ed Morton, Nov 01 '21 at 16:39
ambika you're REALLY missing the point of how to create a MCVE. I'm not sure how else to say it..... Maybe I'll just create my own guess at one given @terdon's comment and post an answer with that and then we can take it from there. — Ed Morton, Nov 01 '21 at 16:50
@EdMorton you are assuming the OP understands the format well enough to provide one. That is probably not the case. These files can be huge (as in gigabytes) and very varied with the name of the fields changing within the same file for different records. It isn't actually trivial to provide an MCVE here. Although it might be easier for you or me to start from the original file, that isn't actually part of the question or necessary. The OP is happy with the two step parsing they're doing, so we can take it from there. — terdon, Nov 01 '21 at 16:53
None of that matters. To create a MCVE they only need to understand the format enough to know that it has lines that want matched by `grep "DEFINITION\|Orig\|product="` (if that's not true the existing approach falls apart), other lines they don't want in between of whatever content, and blocks are separated by lines that are just `//` per your comment. Given that, it's trivial to create a MCVE. I'm just going to do it... — Ed Morton, Nov 01 '21 at 16:59
Is there a way I can upload a file, I just edited my question that shows the head and tail part of the file if that helps. — ambika, Nov 01 '21 at 17:00
@ambika uploading a file would be the wrong thing to do. We don't want to see your actual input file, we want to see a **minimal representation of your input** that you create to show us the kinds of blocks of text you need to handle and separators between those blocks. See the sample input I posted in my answer for example. — Ed Morton, Nov 01 '21 at 17:14

Ed Morton · Accepted Answer · 2021-11-01T17:18:20.033

Given this sample input:

$ cat file1.gbk
DEFINITION  sca_32_chr11_3_0.
foo
            Orig. start  :: 381231
            Orig. end    :: 428233
                     /product="T1PKS"
                     /product="T1PKS"
bar
                     /product="T1PKS"
                     /product="T1PKS"
//
stuff
DEFINITION  sca_32_chr11_3_0.
            Orig. start  :: 464307
            Orig. end    :: 486217
                     /product="terpene"
nonsense
                     /product="terpene"
                     /product="terpene"
                     /product="terpene"
//
DEFINITION  sca_33_chr6_1_0.
            Orig. start  :: 140267
            Orig. end    :: 227928
                     /product="NRPS-like"
                     /product="T1PKS"
whatever
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="NRPS-like"
                     /product="T1PKS"
                     /product="T1PKS"
                     /product="T1PKS"

$ cat file2.gbk
here we go
DEFINITION  sca_39_chr11_5_0.
            Orig. start  :: 270154
more irrelevant text
            Orig. end    :: 324310
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"
                     /product="NRPS"

This script:

$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "DEFINITION"    {
    if ( ++cnt == 1 ) {
        print "Locus name", "start", "end", "ClusterType"
    }
    prt()
    locus = $2
}
/Orig\. start/        { start = $NF }
/Orig\. end/          { end = $NF }
sub(".*/product=","") { gsub(/"/,""); types[$NF] }
END { prt() }

function prt(   ct, type) {
    if ( locus != "" ) {
        for (type in types) {
            ct = (ct=="" ? "" : ct ",") type
        }
        print locus, start, end, ct
    }
    delete types
    locus = ""
}

will produce this output:

$ awk -f tst.awk *.gbk
Locus name      start   end     ClusterType
sca_32_chr11_3_0.       381231  428233  T1PKS
sca_32_chr11_3_0.       464307  486217  terpene
sca_33_chr6_1_0.        140267  227928  T1PKS,NRPS-like
sca_39_chr11_5_0.       270154  324310  NRPS

thank you so much. It worked. – ambika Nov 01 '21 at 18:18 — ambika, Nov 01 '21 at 18:18

How to parse a file and write to separate columns

1 Answers1