Here's some python code you can use:
from Bio import SeqIO
import argparse, sys
def main():
try:
parser = argparse.ArgumentParser(
description="This script will split a provided FASTA file in to multiple sequences up to a defined length."
)
parser.add_argument(
"-i",
"--input",
action="store",
required=True,
help="The FASTA to be processed.",
)
parser.add_argument(
"-l",
"--length",
action="store",
type=int,
required=True,
help="The length of substring to split by.",
)
args = parser.parse_args()
except NameError:
sys.stderr.write(
"An exception occured with argument parsing. Check your provided options."
)
for record in SeqIO.parse(args.input, "fasta"):
chunks = [record.seq[i:i + args.length] for i in range(0, len(str(record.seq)), args.length)]
for index, chunk in enumerate(chunks):
print(">" + record.description + "_" + str(index) + "\n" + chunk)
if __name__ == "__main__":
main()
Use it as python scriptname.py -i fastafile.fa -l 6
(or whatever length you want)
On your test input:
>Header_1_0
AAAAAA
>Header_1_1
BBBBBB
>Header_2_0
CCCCCC
>Header_2_1
DDDDDD
>Header_3_0
EEEEEE
>Header_3_1
FFFFFF
Note that I've added an additional number to each header because @Lieven makes a good point that having multiple sequences with identical headers is likely to cause problems. You can amend the loop to remove this quite easily if you wish.
It should tolerate the 'leftovers' which are less than your specified threshold if you sequences aren't exact multiples of your chosen length, which I imagine they wont be.
just a word of caution: it's not good practise to create fasta files with the exact same header for different proteins (== avoid duplicate headers in a fasta file) . Many tools that parse in fasta files will map them on the header so you will overwrite previous sequences when there are duplicated headers
That's a very good tip! Thanks! In that case, my headers are NCBI taxonomic IDs and I am using Kaiju to align my reads with that protein catalogue. So, as I just want to assign taxonomy to the reads through this alignment, I think there will be no problems.
if fasta file is as simple as that, you can try following simple code:
On what basis are you splitting the sequences? A particular length? Or some feature of the sequence itself?
A particular length. I would like to split them so they can be up to 10k aa long.