I would like to know how to remove the comments from a list of FASTA sequences: I think that awk could provide a good solution, but I am not able to deal with it for this purpose. I welcome all the possible solutions, but those Bash-based are preferred.
A little bit of history for those who are interested... The fasta/Pearson sequence format as described in the FASTA documentation describes the both the contents of the, commonly used, header line ('>') and additional comment lines (starting with ';') as "comments". In common usage only the header lines are used, and most programs don't support the comment lines. See the Wikipedia article (http://en.wikipedia.org/wiki/FASTA_format) for a description of the full format.
To my knowledge, FASTA format doesn't include "comments": it has a header (ID + description), then sequence. Can you give an example of what you want to remove?
I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).
Maybe I call "comment" what you call "description", sorry. An example could be:
mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none".
It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).
Maybe I call "comment" what you call "description", sorry. An example could be: >mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none. I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).
$ curl -s "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&id=25&rettype=fasta&retmode=text"|head -n 3
>gi|25|emb|X53813.1| Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGC
CATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATT
for example, to only keep the gi (for the lines starting with '>', only keep the word after '>gi|' and print it with the prefix '>gi_' )
Just to add to the sed comments here, the command I would use is:
sed's/ .*//' myfile.fasta
If I were to use awk, I'd do
awk'{print $1}' myfile.fasta
Both of these assume you don't have any spaces in your sequences.
I like to err on the side of simplicity when dealing with regexes - if it's hard for me to read/understand, it's hard for me to make sure it's working correctly. Pierre's solution is certainly easier to adapt to removing/modifying different parts, though.
A little bit of history for those who are interested... The fasta/Pearson sequence format as described in the FASTA documentation describes the both the contents of the, commonly used, header line ('>') and additional comment lines (starting with ';') as "comments". In common usage only the header lines are used, and most programs don't support the comment lines. See the Wikipedia article (http://en.wikipedia.org/wiki/FASTA_format) for a description of the full format.
To my knowledge, FASTA format doesn't include "comments": it has a header (ID + description), then sequence. Can you give an example of what you want to remove?
Maybe I call "comment" what you call "description", sorry. An example could be:
I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).
Maybe I call "comment" what you call "description", sorry. An example could be:
It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).
Maybe I call "comment" what you call "description", sorry. An example could be: >mm9knownGeneuc007aet.1 range=chr1:3815714-3825863 5'pad=0 3'pad=0 strand=- repeatMasking=none. I would like to remove the part " range=chr1:3205714-3205863 5'pad=0 3'pad=0 strand=- repeatMasking=none". It could be also useful to know how to remove just the parts " 5'pad=0 3'pad=0 strand=- repeatMasking=none" and the ID (but I agree if you prefer to talk about this in a separate question).
PS: Thanks for the edit, neilfws: in effect it is very hard to deal with a Phocoenidae using awk :D.