I am currently working with a large fasta file (3.7GB) that has scaffolds in it. Each scaffold has a unique identifier that starts with > on the first line and on the consecutive line it has the DNA sequence like this:
>9999992:0-108
AAAGAATTGTATTCCCTCCAGGTAGGGGGGATAGTTGAGGGGATACATAG
TGGGAAGGCTTTTCATGCGGAGGGACTAGAATGTGCTCCCGACTGACAAA
GCAGCTTG
>9999993:0-118
AGGGACTAGAAATGAGATTAAAAAGAGTAAAAGCACTGATACAAGTACAA
AAACAAATTGCTTCACCTCCAAAACCCCAGAAACTGCCCCACTTGGCTCC
CATTTAACCTACCTTCAA
>9999994:0-113
CCATCCTCATCCTTTCCTCCCCATATCTTCCTCTGACCCCAAAGCTCAGG
TTTCCTGTCTTGTTTCCCAGAATCTGTACCTCATGGTAGTTAAACCTTCC
CCTCTGGCAGCCA
>9999997:0-87
AACATCCCTGTGGCCTGAGAGACTGCCAGCCACAGCGGTGACAGTCCCTG
CGAGAGGCTGCTGCAAAAAGACTGGAGAGAAAGCAGA
>9999998:0-100
AAACATCAGCGCCAAGTCCCCGAAACCAGCAGGGTCACTGGGCGGCCGGC
CTGAAATACCCCAGCAGGCCAGCAGTGCCGGGTGCCTGGGGAGGTGTCCT
>9999999:0-94
AAGAAACTTTTCCCTTAACCAATGAAGAGTTTTATGTAAAGGAAATTTAG
TAATTTTTTAAAAAATGGTAATGACAGATTTAAGTAATTTAATT
I want to split the file into small files preferably of the same length to process it, but I need to respect the ID and the sequence together, and obtain something like this:
file1.fa
>9999992:0-108
AAAGAATTGTATTCCCTCCAGGTAGGGGGGATAGTTGAGGGGATACATAG
TGGGAAGGCTTTTCATGCGGAGGGACTAGAATGTGCTCCCGACTGACAAA
GCAGCTTG
>9999993:0-118
AGGGACTAGAAATGAGATTAAAAAGAGTAAAAGCACTGATACAAGTACAA
AAACAAATTGCTTCACCTCCAAAACCCCAGAAACTGCCCCACTTGGCTCC
CATTTAACCTACCTTCAA
file2.fasta
>9999994:0-113
CCATCCTCATCCTTTCCTCCCCATATCTTCCTCTGACCCCAAAGCTCAGG
TTTCCTGTCTTGTTTCCCAGAATCTGTACCTCATGGTAGTTAAACCTTCC
CCTCTGGCAGCCA
>9999997:0-87
AACATCCCTGTGGCCTGAGAGACTGCCAGCCACAGCGGTGACAGTCCCTG
CGAGAGGCTGCTGCAAAAAGACTGGAGAGAAAGCAGA
file3.fasta
>9999998:0-100
AAACATCAGCGCCAAGTCCCCGAAACCAGCAGGGTCACTGGGCGGCCGGC
CTGAAATACCCCAGCAGGCCAGCAGTGCCGGGTGCCTGGGGAGGTGTCCT
>9999999:0-94
AAGAAACTTTTCCCTTAACCAATGAAGAGTTTTATGTAAAGGAAATTTAG
TAATTTTTTAAAAAATGGTAATGACAGATTTAAGTAATTTAATT
Please help me. I have tried to use csplit and grep but I get the wrong outputs.