Hi,
I want to append numbers on all headers (gene names), not only duplicates or multiplicates but also unique ones, in FASTA file. Does anyone have solutions? My file is pretty large so I am trying to find some code. Unfortunately, I can not currently run R scripts in our institute server, so I would prefer python, perl, etc. It is similar to the previous one, but a bit different. Here is my fasta:
>geneA
atcc
>geneB
aaat
>geneB
aaat
>geneB
aaat
>geneC
atgg
>geneC
atgg
>geneD
atcg
I need to append numbers like this:
>geneA#1
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD#1
atcg
I have tried a python code but it only adds numbers to multiplicates and doesn't add #1 to all unique genes.
from Bio import SeqIO
records = set()
of = open("output.fa", "w")
for record in SeqIO.parse("myfasta.fa", "fasta"):
ID = record.id
num = 1
while ID in records:
ID = "{}#{}".format(record.id, num)
num += 1
records.add(ID)
record.id = ID
record.name = ID
record.description = ID
SeqIO.write(record, of, "fasta")
of.close()
The output is:
>geneA
atcc
>geneB#1
aaat
>geneB#2
aaat
>geneB#3
aaat
>geneC#1
atgg
>geneC#2
atgg
>geneD
atcg
I would appreciate it if anyone can provide a solution or any ideas.
Thanks!
Thank you very much, everyone. I was able to covert my dataset.