Remove text flanking .. on fasta-headers
3
1
Entering edit mode
5.7 years ago

Hi guys,

I have a multi-fasta like this

>Citrobacter_freundii_D8_6645..17576
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012_3830..23574
atggacgatagagaaagaggcttagcatttttatttgcaatt

And I would like to eliminate the numbers flanking .., to have an output like this

>Citrobacter_freundii_D8
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012
atggacgatagagaaagaggcttagcatttttatttgcaatt

Since the number are variable, I guess just creating a command to remove x characters from the end of the fasta-header won't be enough. Thanks!

genome sequence • 1.5k views
ADD COMMENT
2
Entering edit mode
5.7 years ago
ATpoint 85k

If the example is representative, then you basically intend to keep the first three elements that are separated by _. If so, do:

awk ' $1 ~ /^>/ { split($0,a,"_"); print a[1]"_"a[2]"_"a[3];next} {print}'

Command splits every line that starts with > at the _ and then simply prints the first three separated by _ again. Obviously that only works if all fasta headers look like the ones you showed.

ADD COMMENT
2
Entering edit mode
5.7 years ago
$ sed '/>/ s/_[0-9]\+\.\..*$//g' test.fa
>Citrobacter_freundii_D8
gtgatcgtcaagaaggttaagaacccgcagaaggcagca
>Enterobacter_hormaechei_35012
atggacgatagagaaagaggcttagcatttttatttgcaatt
ADD COMMENT
0
Entering edit mode

Thanks, saved my day!

ADD REPLY
0
Entering edit mode

You can accept more than one answer, if they all work. Just so you know.

ADD REPLY
0
Entering edit mode
5.7 years ago
Joe 21k

A bash only solution*, for good measure (because I can't help myself):

$ while read l; do echo "${l%_*}"; done < seqs.fasta

*Assumes there are no other underscores elsewhere beyond the D8 string etc.

ADD COMMENT

Login before adding your answer.

Traffic: 2678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6