I am sorry if this has been asked before, but I have a genome assembly file that I just converted from .bam to fasta format in order to start annotation. I would like to run CEGMA on this assembly, because I have concerns about the quality, but the problem is that the default header format when the fasta was created is not acceptable. This is because in the current format here are 5237924 sequences with FASTA headers that either contain only digits or have just digits followed by a space. E.g.
>1
>22 |
>333 xyz
I need headers that have no spaces and also have non-numeric characters (letters) as the current headers don't work with blast. Ideally I would like to simply name each sequence as a scaffold followed by a number identifier for the scaffold (so that each header would be named scaffoldn where n is the number of each scaffold in the entire assembly. But, my coding experience is very limited and any suggestions you might have would be very helpful.
Thanks,
Zach
Thank you very much Brian. It worked easily.