I have a scaffold with N`s inside but I want to split it into separated contigs. The first reason is because I have N's and the other is because I have non-IUPAC characters into my sequence. Just trying I split by N's and eliminated the sequences smaller than 1.
Any suggestion using SeqIO or any other tool.
myfile.fasta
>mysequence
agtagatgatgatagatgatgatgaNNNNtgttgcatgctagctagctagtcgatcgatcgatcgtagctagcaNNNtcgatcgatgtagctagctgacaNctagtcgatgca
my temporary output.fasta using this:
sed -i.bak 's/N/\n>N\n/g' myfile.fasta
>N
agtagatgatgatagatgatgatga
>N
>N
>N
>N
>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>N
>N
>N
tcgatcgatgtagctagctgacaNctagtcgatgca
Further I eliminate the NULL sequences or filter >500 to obtain a reasonable set of sequences.
>N
agtagatgatgatagatgatgatga
>N
tgttgcatgctagctagctagtcgatcgatcgatcgtagctagca
>N
tcgatcgatgtagctagctgacaNctagtcgatgca
The problem you can imagine. All sequences have the same names.
How to enumerate them in this order?
's/N/\n>N\n/g'
takes effect on the FASTA headers. So you have to discard seq names first. And you should use regular expression[Nn]+' instead of
N` for splitting.