Remove Fasta Sequences with Duplicate IDs (but with different Descriptions) & Append Different Descriptions
1
0
Entering edit mode
8.2 years ago

Hello,

My first post, so I hope I'm posting this in the correct place!

I have ~100k fasta sequences - some with duplicate fasta IDs (they also have identical sequences), but with unique descriptions. I would like to extract unique fasta sequences based on ID (so, remove duplicates, but keep one representative sequence), but also append the description associated with the duplicates.

For example, my fasta file might contain the following 3 sequences:

>Contig1
ATGCGAGTAG

>Contig1 Description1
ATGCGAGTAG

>Contig1 Description2
ATGCGAGTAG

And I'm looking to obtain the following single sequence:

>Contig1 Description1 Description2
ATGCGAGTAG

Thanks for any help :)

RNA-Seq sequence • 3.3k views
ADD COMMENT
0
Entering edit mode

I have been trying to use fasuniq, but this only can concatenate the IDs of duplicated sequences.

ADD REPLY
0
Entering edit mode

While the dedeuplication part can be achieved by different programs dedupe.sh from BBMap suite is one) if you must have the descriptions appended to the deduped sequence then that would require a specific solution.

ADD REPLY
5
Entering edit mode
8.2 years ago
baxy ▴ 170

Quick solution under Linux/Perl

perl -ne 'if (/>(.*?)\s+(.*)/){push(@{$hash{$1}},$2) ;}}{open(I, "<","test.fa");while(<I>){if(/>(.*?)\s+/){ $t = 0; next if $h{$1}; $h{$1} = 1 if $hash{$1}; $t = 1; chomp; print $_ . " @{$hash{$1}}\n"}elsif($t==1){print $_} } close I;' test.fa

where test.fa is your file (note that the file is defined at two places ) also change the code accordingly in case the separator is a tabulator

ADD COMMENT
0
Entering edit mode

This perfectly did the trick - thank you baxy! Brilliant! Now, I need to go study the code you wrote :)

ADD REPLY
0
Entering edit mode

Any suggestions to work this as a loop for hundreds of files?

ADD REPLY

Login before adding your answer.

Traffic: 2875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6