1

I am currently working with a large fasta file (3.7GB) that has scaffolds in it. Each scaffold has a unique identifier that starts with > on the first line and on the consecutive line it has the DNA sequence like this:

>9999992:0-108
AAAGAATTGTATTCCCTCCAGGTAGGGGGGATAGTTGAGGGGATACATAG
TGGGAAGGCTTTTCATGCGGAGGGACTAGAATGTGCTCCCGACTGACAAA
GCAGCTTG
>9999993:0-118
AGGGACTAGAAATGAGATTAAAAAGAGTAAAAGCACTGATACAAGTACAA
AAACAAATTGCTTCACCTCCAAAACCCCAGAAACTGCCCCACTTGGCTCC
CATTTAACCTACCTTCAA
>9999994:0-113
CCATCCTCATCCTTTCCTCCCCATATCTTCCTCTGACCCCAAAGCTCAGG
TTTCCTGTCTTGTTTCCCAGAATCTGTACCTCATGGTAGTTAAACCTTCC
CCTCTGGCAGCCA
>9999997:0-87
AACATCCCTGTGGCCTGAGAGACTGCCAGCCACAGCGGTGACAGTCCCTG
CGAGAGGCTGCTGCAAAAAGACTGGAGAGAAAGCAGA
>9999998:0-100
AAACATCAGCGCCAAGTCCCCGAAACCAGCAGGGTCACTGGGCGGCCGGC
CTGAAATACCCCAGCAGGCCAGCAGTGCCGGGTGCCTGGGGAGGTGTCCT
>9999999:0-94
AAGAAACTTTTCCCTTAACCAATGAAGAGTTTTATGTAAAGGAAATTTAG
TAATTTTTTAAAAAATGGTAATGACAGATTTAAGTAATTTAATT

I want to split the file into small files preferably of the same length to process it, but I need to respect the ID and the sequence together, and obtain something like this:

file1.fa
>9999992:0-108
AAAGAATTGTATTCCCTCCAGGTAGGGGGGATAGTTGAGGGGATACATAG
TGGGAAGGCTTTTCATGCGGAGGGACTAGAATGTGCTCCCGACTGACAAA
GCAGCTTG
>9999993:0-118
AGGGACTAGAAATGAGATTAAAAAGAGTAAAAGCACTGATACAAGTACAA
AAACAAATTGCTTCACCTCCAAAACCCCAGAAACTGCCCCACTTGGCTCC
CATTTAACCTACCTTCAA

file2.fasta
>9999994:0-113
CCATCCTCATCCTTTCCTCCCCATATCTTCCTCTGACCCCAAAGCTCAGG
TTTCCTGTCTTGTTTCCCAGAATCTGTACCTCATGGTAGTTAAACCTTCC
CCTCTGGCAGCCA
>9999997:0-87
AACATCCCTGTGGCCTGAGAGACTGCCAGCCACAGCGGTGACAGTCCCTG
CGAGAGGCTGCTGCAAAAAGACTGGAGAGAAAGCAGA

file3.fasta
>9999998:0-100
AAACATCAGCGCCAAGTCCCCGAAACCAGCAGGGTCACTGGGCGGCCGGC
CTGAAATACCCCAGCAGGCCAGCAGTGCCGGGTGCCTGGGGAGGTGTCCT
>9999999:0-94
AAGAAACTTTTCCCTTAACCAATGAAGAGTTTTATGTAAAGGAAATTTAG
TAATTTTTTAAAAAATGGTAATGACAGATTTAAGTAATTTAATT

Please help me. I have tried to use csplit and grep but I get the wrong outputs.

terdon
  • 234,489
  • 66
  • 447
  • 667
  • please show what you tried and show the results you got – jsotola Oct 13 '22 at 02:30
  • *split the file into small files preferably of the same length*? What is the length that you want for each file? 100MB perhaps? – Edgar Magallon Oct 13 '22 at 02:56
  • Could you split the file into several files that contain the same number of `unique identifier` (with its respective DNA) ? I thing that's a possible solution without having to check the length for each file. – Edgar Magallon Oct 13 '22 at 03:04
  • Yes, I would like to split it into 100 MB but with the identifier and the sequence together. and that's a good idea having the same number of unique identifiers the character that identifies them and all of them have in common is '>' I have 3714529 scaffolds so if I divide the whole file into 10 files with 371,453 unique identifiers would be also okay. – Nadia Tamayo Oct 13 '22 at 03:33
  • You may not be aware of the Stackexchange Bioinformatics site. It has multiple questions with answers relating to splitting fasta files: https://bioinformatics.stackexchange.com/search?q=split+fasta The users of that site would also be able to help you with specialised software for working with fasta files. – Kusalananda Oct 13 '22 at 06:08

2 Answers2

0

This code below might help you but it might be slow for huge files (according to your computer specs).

First, you should get the total of Unique Identifier. You can achieve that by using grep -c

total=$(grep -c "^>" largeFastaFile.txt)

The code above will assign to total variable the count of matching lines that start with >. Now you should get the number of Unique Identifier you will have per file. So if you want to have 10 files. You should divide total/10:

max=$((total/10))
#If total has 3714529 then max will have 371,452.

Finally with awk command you can plit your large file in 10 files (actually 11) with approximately 371,452 Unique Identifier per file:

awk -v maxline=$max -v count=0\ 
 '{if(NR>1) { if( (NR-2)%maxline == 0 ) count++ ; print ">"$0 >("file"count".fasta")  } }'\
 RS='>' largeFastaFile.txt

The script should look like this:

#!/bin/bash

total=$(grep -c "^>" largeFastaFile.txt)
max=$((total/10)) # where 10 is the number of files you will get

awk -v maxline=$max -v count=0 '{if(NR>1) { if( (NR-2)%maxline == 0 ) count++ ; print ">"$0 >("file"count".fasta")  } }' RS='>' largeFastaFile.txt

Actually you will get 11 files because if the total is 3714529 and you want 10 files with the same number of 371,452 Unique Identifier per file then there will be some few lines which will have to be in another file:

371,452 * 10 = 3,714,520

So the 11th file will have the last 9 Unique Identifier

Edgar Magallon
  • 4,711
  • 2
  • 12
  • 27
0

I have been using a couple of scripts written by an old colleague years ago to convert between fasta and tbl (sequence identifier, TAB, sequence on a single line) formats. I have posted the scripts here. Using them, you can easily convert your file to one sequence per line, then split specifying how many lines you want in each file, and then convert the split files back to fasta:

FastaToTbl file.fa | split -l 2 - splitFile

Using your example input, that creates these files:

$ for f in splitFile*; do printf "\n===== File: %s ====\n" "$f"; cat "$f"; done

===== File: splitFileaa ====
9999992:0-108   AAAGAATTGTATTCCCTCCAGGTAGGGGGGATAGTTGAGGGGATACATAGTGGGAAGGCTTTTCATGCGGAGGGACTAGAATGTGCTCCCGACTGACAAAGCAGCTTG
9999993:0-118   AGGGACTAGAAATGAGATTAAAAAGAGTAAAAGCACTGATACAAGTACAAAAACAAATTGCTTCACCTCCAAAACCCCAGAAACTGCCCCACTTGGCTCCCATTTAACCTACCTTCAA

===== File: splitFileab ====
9999994:0-113   CCATCCTCATCCTTTCCTCCCCATATCTTCCTCTGACCCCAAAGCTCAGGTTTCCTGTCTTGTTTCCCAGAATCTGTACCTCATGGTAGTTAAACCTTCCCCTCTGGCAGCCA
9999997:0-87    AACATCCCTGTGGCCTGAGAGACTGCCAGCCACAGCGGTGACAGTCCCTGCGAGAGGCTGCTGCAAAAAGACTGGAGAGAAAGCAGA

===== File: splitFileac ====
9999998:0-100   AAACATCAGCGCCAAGTCCCCGAAACCAGCAGGGTCACTGGGCGGCCGGCCTGAAATACCCCAGCAGGCCAGCAGTGCCGGGTGCCTGGGGAGGTGTCCT
9999999:0-94    AAGAAACTTTTCCCTTAACCAATGAAGAGTTTTATGTAAAGGAAATTTAGTAATTTTTTAAAAAATGGTAATGACAGATTTAAGTAATTTAATT

Change the -l 2 to however many sequences you want per file. Then, iterte over the split files and convert each of them to fasta:

c=0
for f in splitFile*; do 
    outFile=file.$((++c)).fa
    TblToFasta "$f" > "$outFile"
done

This results in the following files:

$ for f in file.*.fa; do printf "\n===== File: %s ====\n" "$f"; cat "$f"; done

===== File: file.1.fa ====
>9999992:0-108 
AAAGAATTGTATTCCCTCCAGGTAGGGGGGATAGTTGAGGGGATACATAGTGGGAAGGCT
TTTCATGCGGAGGGACTAGAATGTGCTCCCGACTGACAAAGCAGCTTG
>9999993:0-118 
AGGGACTAGAAATGAGATTAAAAAGAGTAAAAGCACTGATACAAGTACAAAAACAAATTG
CTTCACCTCCAAAACCCCAGAAACTGCCCCACTTGGCTCCCATTTAACCTACCTTCAA

===== File: file.2.fa ====
>9999994:0-113 
CCATCCTCATCCTTTCCTCCCCATATCTTCCTCTGACCCCAAAGCTCAGGTTTCCTGTCT
TGTTTCCCAGAATCTGTACCTCATGGTAGTTAAACCTTCCCCTCTGGCAGCCA
>9999997:0-87 
AACATCCCTGTGGCCTGAGAGACTGCCAGCCACAGCGGTGACAGTCCCTGCGAGAGGCTG
CTGCAAAAAGACTGGAGAGAAAGCAGA

===== File: file.3.fa ====
>9999998:0-100 
AAACATCAGCGCCAAGTCCCCGAAACCAGCAGGGTCACTGGGCGGCCGGCCTGAAATACC
CCAGCAGGCCAGCAGTGCCGGGTGCCTGGGGAGGTGTCCT
>9999999:0-94 
AAGAAACTTTTCCCTTAACCAATGAAGAGTTTTATGTAAAGGAAATTTAGTAATTTTTTA
AAAAATGGTAATGACAGATTTAAGTAATTTAATT

The two scripts used are:

  • FastaToTbl

      #!/usr/bin/awk -f
      {
              if (substr($1,1,1)==">")
              if (NR>1)
                          printf "\n%s\t", substr($0,2,length($0)-1)
              else 
                  printf "%s\t", substr($0,2,length($0)-1)
              else 
                      printf "%s", $0
      }END{printf "\n"}
    
  • TblToFasta

      #! /usr/bin/awk -f
      {
        sequence=$NF
    
        ls = length(sequence)
        is = 1
        fld  = 1
        while (fld < NF)
        {
           if (fld == 1){printf ">"}
           printf "%s " , $fld
    
           if (fld == NF-1)
            {
              printf "\n"
            }
            fld = fld+1
        }
        while (is <= ls)
        {
          printf "%s\n", substr(sequence,is,60)
          is=is+60
        }
      }
    

Save those in your $PATH and make them executable.

terdon
  • 234,489
  • 66
  • 447
  • 667