Edit FASTA header to add organism name after the accession number using perl or sed.
2
0
Entering edit mode
6.5 years ago
MB ▴ 50

I have multiple FASTA files consisting of more than a thousand FASTA sequences with FASTA header as follows:

>KXL50728 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:153:473:1 gene:FE78DRAFT_27124 transcript:KXL50728 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
>KXL50729 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:642:809:1 gene:FE78DRAFT_126205 transcript:KXL50729 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein

I want to edit these headers as follows:

>KXL50728Acidomycesrichmondensis
>KXL50729Acidomycesrichmondensis

Could anybody please tell me how to do it using Perl or using sed command (most preferable)?

Perl FASTA sed • 2.4k views
ADD COMMENT
0
Entering edit mode

Thanks to all, it worked!

ADD REPLY
1
Entering edit mode

You're welcome.

Please be so kind to mark all answers as accepted. Doing so everyone can see that this solve your problem.

fin swimmer

ADD REPLY
0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. I have now moved your reaction but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY
2
Entering edit mode
6.5 years ago

Hello MB,

try this:

$ sed 's/^\(>\S*\).*/\1Acidomycesrichmondensis/' your.fasta > new.fasta

What we are asking sed to do is: In every line which startet with > keep any character until the first non-whitespace character and replace the rest of the line with Acidomycesrichmondensis.

In the regex:

  • ^ matches for the line start
  • (...) build a group, so we can output it later
  • \S* matches for as many non-whitespace characters as possible
  • .* matches for any other character

In the substition:

  • \1 print the first group we defined in the regex
  • replaces the rest of line with Acidomycesrichmondensis

Another way is to use awk:

$ awk -F " " '{if($0 ~ "^>") {print $1"Acidomycesrichmondensis"} else {print $0}}' your.fasta > new.fasta

fin swimmer

ADD COMMENT
2
Entering edit mode
6.5 years ago
$ sed  '/>/ s/\s.*/Acidomycesrichmondensis/' test.fa 
$ awk '/>/ {gsub (" .*","Acidomycesrichmondensis", $0)}1' test.fa

input:

$ cat test.fa 
>KXL50728 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:153:473:1 gene:FE78DRAFT_27124 transcript:KXL50728 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
atgc
>KXL50729 pep supercontig:ASM157207v1:5WFSArich_Contig_00366:642:809:1 gene:FE78DRAFT_126205 transcript:KXL50729 gene_biotype:protein_coding transcript_biotype:protein_coding description:hypothetical protein
atgc

output:

$ awk '/>/ {gsub (" .*","Acidomycesrichmondensis", $0)}1' test.fa 
>KXL50728Acidomycesrichmondensis
atgc
>KXL50729Acidomycesrichmondensis
atgc
$ sed  '/>/ s/\s.*/Acidomycesrichmondensis/' test.fa 
>KXL50728Acidomycesrichmondensis
atgc
>KXL50729Acidomycesrichmondensis
atgc
ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6