How to add the suffix if the entries are the same in fasta file
2
0
Entering edit mode
7.2 years ago
horsedog ▴ 60

I got a bunch of genome sequences in the same fie named sequence.fasta but some of them have the exact same names, like this:

> Rhodobacter_sphaeroides_2.4.1_chromosome_2
ATGAGCTTTCCCCATTTCGCGGCCCTCTTCCGGCCCTCGCAGTTCTTCGGCATCCGCGGCGGCGTCCACCCCGAGACGCG
>Rhodobacter_sphaeroides_2.4.1_chromosome_2
GTGCAGGTGGTGCCGACCCAGTATCCGATGGGCTCGGAGAAGCATCTGGTGAAGATCCTGACCGGGCGCGAGACGCCGGC

Is there any way to detect those sequences with the same name and add suffix automatically, so i can distinguish. this is what i want:

> Rhodobacter_sphaeroides_2.4.1_chromosome_2.1
ATGAGCTTTCCCCATTTCGCGGCCCTCTTCCGGCCCTCGCAGTTCTTCGGCATCCGCGGCGGCGTCCACCCCGAGACGCG
> Rhodobacter_sphaeroides_2.4.1_chromosome_2.2
GTGCAGGTGGTGCCGACCCAGTATCCGATGGGCTCGGAGAAGCATCTGGTGAAGATCCTGACCGGGCGCGAGACGCCGGC

But for those who have unique names just leave them.

Thanks a lot!

sequencing gene • 1.9k views
ADD COMMENT
0
Entering edit mode

before the name there is a ">" so it's like this

Rhodobacter_sphaeroides_2.4.1_chromosome_2

ADD REPLY
0
Entering edit mode

http://bioinf.shenwei.me/seqkit/usage/#rename

seqkit rename seqs.fa > new.fa
ADD REPLY
0
Entering edit mode
7.2 years ago

linearize, sort, count the uniq names:

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' input.fa | sort -t $'\t' -k1,1 | awk -F '\t' 'BEGIN{N=0;prev="";}{if(prev==$1) { N++;} else {N=1;} printf("%s.%d\n%s\n",$1,N,$2);prev=$1;}'


>1_anotherUniqueGeneName.1
atgc
>1_duplicateName.1
atgc
>1_duplicateName.2
atgc
>1_uniqueGeneName.1
atgc
ADD COMMENT
0
Entering edit mode

Thank you very much!

ADD REPLY
0
Entering edit mode
7.2 years ago

With BBMap's reformat.sh:

reformat.sh in=file.fa out=fixed.fa uniquenames

That appends "_2", "_3", etc to the second and 3rd instance of a name. The first time a name occurs it will be unaffected.

ADD COMMENT

Login before adding your answer.

Traffic: 1625 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6