Question

How To Remove The Comments From A List Of Fasta Sequences

2

Entering edit mode

13.1 years ago

Anima Mundi ★ 2.9k

Hello,

I would like to know how to remove the comments from a list of FASTA sequences: I think that awk could provide a good solution, but I am not able to deal with it for this purpose. I welcome all the possible solutions, but those Bash-based are preferred.

fasta awk bash • 10.0k views

ADD COMMENT • link updated 13.1 years ago by Frédéric Mahé ★ 3.2k • written 13.1 years ago by Anima Mundi ★ 2.9k

2

Entering edit mode

A little bit of history for those who are interested... The fasta/Pearson sequence format as described in the FASTA documentation describes the both the contents of the, commonly used, header line ('>') and additional comment lines (starting with ';') as "comments". In common usage only the header lines are used, and most programs don't support the comment lines. See the Wikipedia article (http://en.wikipedia.org/wiki/FASTA_format) for a description of the full format.

ADD REPLY • link 13.1 years ago by Hamish ★ 3.3k

0

Entering edit mode

To my knowledge, FASTA format doesn't include "comments": it has a header (ID + description), then sequence. Can you give an example of what you want to remove?

ADD REPLY • link 13.1 years ago by Neilfws 49k

0

Entering edit mode

Maybe I call "comment" what you call "description", sorry. An example could be:

mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none

I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLY • link 13.1 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

Maybe I call "comment" what you call "description", sorry. An example could be:

mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none".

It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLY • link 13.1 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

Maybe I call "comment" what you call "description", sorry. An example could be: >mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none. I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).

ADD REPLY • link 13.1 years ago by Anima Mundi ★ 2.9k

0

Entering edit mode

PS: Thanks for the edit, neilfws: in effect it is very hard to deal with a Phocoenidae using awk :D.

ADD REPLY • link 13.1 years ago by Anima Mundi ★ 2.9k

score 4 · Answer 1 · 2012-02-22

Use sed:

$ curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" | head -n 3

>gi|25|emb|X53813.1| Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT

for example, to only keep the gi (for the lines starting with '>', only keep the word after '>gi|' and print it with the prefix '>gi_' )

$ curl -s  "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text" |\
  sed '/^>/s/^>gi|\([0-9]*\)|.*/>gi_\1/' |head -n 3
>gi_25
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT

score 4 · Answer 2 · 2012-02-22

4

Entering edit mode

13.1 years ago

Frédéric Mahé ★ 3.2k

If you want to remove the description and if your headers are structured like that: "> + id + space + description", then sed can help:

sed -e 's/^\(>[^[:space:]]*\).*/\1/' my.fasta > mymodified.fasta

ADD COMMENT • link 13.1 years ago by Frédéric Mahé ★ 3.2k

0

Entering edit mode

I choose this answer because is the one I used, but I also tried the others and they perfectly work. Thanks to you all!

ADD REPLY • link 13.1 years ago by Anima Mundi ★ 2.9k

score 3 · Answer 3 · 2012-02-22

Just to add to the sed comments here, the command I would use is:

sed 's/ .*//' myfile.fasta

If I were to use awk, I'd do

awk '{print $1}' myfile.fasta

Both of these assume you don't have any spaces in your sequences.

I like to err on the side of simplicity when dealing with regexes - if it's hard for me to read/understand, it's hard for me to make sure it's working correctly. Pierre's solution is certainly easier to adapt to removing/modifying different parts, though.