Entering edit mode
7.7 years ago
Rose
•
0
Hi I still have problem to modify my multifasta file:
>gi|51039021|ref|NC_006130.1| Streptococcus pyogenes 71-724 plasmid pDN571, complete sequence
>gi|62945224|ref|NC_006976.1| Mannheimia haemolytica 3259 plasmid pCCK3259, complete sequence
>gi|63219713|ref|NC_006994.1| Pasteurella multocida 381 plasmid pCCK381, complete sequence
to
>gi|51039021|pDN571
>gi|62945224|pCCK3259
>gi|63219713|pCCK381
Please you help will be very useful. Thanks
Did you make two accounts to ask these types of similar questions (Blaise )? ref this thread from today: Modifying Fasta file header
We are two researchers, having everyone his own problem despite the fact that we are working in the same group. They were two problems, one found solution, the still need solution for the second one.
I see. You are using the example data the other person had posted which led to my question.
There are multiple threads that deal with these types of header manipulations. Have you tried to search through them?
yes, but without success
If you show what you tried and didn't work people will be more eager to point out your mistake or help you further, because you demonstrate you have put effort in this problem, too.
It might be helpful to know why you want to modify your headers in this fashion and what some of your other headers look like.
I would like to run a BLAT, having unique Fasta identifiers
This should be reasonably straightforward using biopython SeqIO.
Won't Biopython shorten the headers automatically, thus being unable to parse the header?
I'm not aware of that, what makes you think it does?
Most methods that access FASTA entries using the offsets stored in a *.fai file will truncate the header name at the first whitespace. However, Bio.SeqIO does not use this scheme. Both samtools and pyfaidx do, but there's a method in pyfaidx:
FastaRecord.longname
will recover the entire header name from the FASTA file and won't use the truncated version stored in the index file.Thanks, Matt. I did a quick test using the OPs first example header, and it truncated to >gi|51039021|ref|NC_006130.1|
Apparently biopython uses the strict definition (if FASTA has any) of the ID as everything before the first space. See Can Biopython Properly Import Fasta Headers With Spaces In Them? To get the whole header you want
SeqRecord.description
notSeqRecord.id