Entering edit mode
6.9 years ago
jack1120
▴
30
I need to reformat headers in a fasta file with headers such as:
>Agaricus_chiangmaiensis|JF514531|SH174817.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Agaricales;f__Agaricaceae;g__Agaricus;s__Agaricus_chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>Acarospora_laqueata|DQ842014|SH191965.07FU|refs|k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora_laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>Ceratobasidiaceae_sp|DQ493566|SH185440.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Cantharellales;f__Ceratobasidiaceae;g__unidentified;s__Ceratobasidiaceae_sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC
So that they look like:
>SH174817.07FU Agaricus chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>SH191965.07FU Acarospora laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>SH185440.07FU Ceratobasidiaceae sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC
Is there a relatively simple code that can isolate these specific elements and re-order them? I think I can get the first part with something like:
grep -r -o "SH.*FU" file.fasta
But I am unsure how to isolate and reformat the genus and species names in addition to that.
This is the most asked question on BioStars, I’d suggest you start with the search box on this site.
My answer in this thread for example, will do what you want (with a little tweaking, and assuming your fasta’s are linear).
A: Fasta header trimming for multiple delimiters
That's fair. I understand the frustration and apologize for the poor etiquette. I did search some general programming sites beforehand, but lazily plopped my question here looking a quick fix after that. I'll be better!
Not really a bioinformatics question, more of a programming one. Using your favorite scripting language, extract the header, split the content on the | separator and output what you need.