Reformatting fasta headers
2
1
Entering edit mode
6.9 years ago
jack1120 ▴ 30

I need to reformat headers in a fasta file with headers such as:

>Agaricus_chiangmaiensis|JF514531|SH174817.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Agaricales;f__Agaricaceae;g__Agaricus;s__Agaricus_chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>Acarospora_laqueata|DQ842014|SH191965.07FU|refs|k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora_laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>Ceratobasidiaceae_sp|DQ493566|SH185440.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Cantharellales;f__Ceratobasidiaceae;g__unidentified;s__Ceratobasidiaceae_sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC

So that they look like:

>SH174817.07FU Agaricus chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>SH191965.07FU Acarospora laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>SH185440.07FU Ceratobasidiaceae sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC

Is there a relatively simple code that can isolate these specific elements and re-order them? I think I can get the first part with something like:

grep -r -o "SH.*FU" file.fasta

But I am unsure how to isolate and reformat the genus and species names in addition to that.

next-gen sequencing fasta headers • 2.6k views
ADD COMMENT
1
Entering edit mode

This is the most asked question on BioStars, I’d suggest you start with the search box on this site.

My answer in this thread for example, will do what you want (with a little tweaking, and assuming your fasta’s are linear).

A: Fasta header trimming for multiple delimiters

ADD REPLY
0
Entering edit mode

That's fair. I understand the frustration and apologize for the poor etiquette. I did search some general programming sites beforehand, but lazily plopped my question here looking a quick fix after that. I'll be better!

ADD REPLY
0
Entering edit mode

Not really a bioinformatics question, more of a programming one. Using your favorite scripting language, extract the header, split the content on the | separator and output what you need.

ADD REPLY
2
Entering edit mode
6.9 years ago

Given in.fa:

$ more in.fa
>Agaricus_chiangmaiensis|JF514531|SH174817.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Agaricales;f__Agaricaceae;g__Agaricus;s__Agaricus_chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>Acarospora_laqueata|DQ842014|SH191965.07FU|refs|k__Fungi;p__Ascomycota;c__Lecanoromycetes;o__Acarosporales;f__Acarosporaceae;g__Acarospora;s__Acarospora_laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>Ceratobasidiaceae_sp|DQ493566|SH185440.07FU|reps|k__Fungi;p__Basidiomycota;c__Agaricomycetes;o__Cantharellales;f__Ceratobasidiaceae;g__unidentified;s__Ceratobasidiaceae_sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC

Here's one way:

$ awk '{ if ($0~/^>/) { n=split($0, a, "|"); gsub(/_/," ", a[1]); printf(">%s %s\n", a[3], substr(a[1], 2)); } else { print $0; } }' in.fa
>SH174817.07FU Agaricus chiangmaiensis
TTGAATTATGTTTTCTAGATGGGTTGTAGCTGGCTCTTCGGAGCATGTGCACGCCTGCCTGGATTTCATTTTCATCCACCTGTGCACCTATTGTAGTCTCTGTCGGGTATTGAGGAAGTG
>SH191965.07FU Acarospora laqueata
TCGAGTTAGGGTCCCTCGGGCCCAACCTCCAACCCTTTGTGTACCTACTTTTGTTGCTTTGGCGGGCCCGCTGGGAAACTCCACCGGCGGCCACAGGCTGCCGAGCGCCCGTCAGA
>SH185440.07FU Ceratobasidiaceae sp
TCGAACGAATGTAGAGTCGGTTGTCGCTGGCCCTCTCTGCTGGGCATGTGCACACCTTCTCTTTCATCCACACACACCTGTGCACTCGTGAAGACGGAAGGAGCGCCCTTGGGCGGCGTCC
ADD COMMENT
1
Entering edit mode

This works perfectly. Thank you, Alex!

ADD REPLY
2
Entering edit mode
6.9 years ago
sed '/^>/s/>\([^|]*\)|[^\|]*|\([^|]*\)|.*/>\2 \1/;/^>/s/_/ /g' in.fasta
ADD COMMENT
4
Entering edit mode

Your cat walked on your keyboard?

ADD REPLY

Login before adding your answer.

Traffic: 1846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6