Question

Remove Duplicates In Fasta (Protein Seq.)

0

Entering edit mode

11.3 years ago

IsmailM ▴ 110

Hi, I am trying to remove duplicates entries in a fasta file (containing protein sequences). I have looked for similar posts and have tried some of their fixes, but options like faidx toolkit only work with nucleotides fasta files and as of yet I haven't been able to find something that actually works.

At the moment, I extract the seq. IDs and run the following script:

 #!/usr/bin/env ruby

filename = ARGV.first 
text = File.read(filename)
entryid = />\S+/i

text.scan(entryid).uniq.sort.each  do |output|
    puts output.chomp
end

This removes all duplicate entries. I then use samtools to extract the sequence from the original file to give me a much smaller file with the seq IDs and the respective sequence.

However this is too slow - when running with large files - it has taken 4 hours and is still running.

Is there any alternative method that is a little faster..

EDIT:

I am dealing with 5 species that I run this on. I am using a batch type file to run the scripts. The largest file consists of 332,369 sequences which only contains 72,552 unique sequences.

Secondly, I would be grateful if you type any scripts in ruby since that is the programming language that I can just about get my head around at the moment - I'm a total beginner in programming/ bioinformatics.

Many Thanks
Ismail

fasta duplicates • 6.4k views

ADD COMMENT • link updated 11.3 years ago by Rm 8.3k • written 11.3 years ago by IsmailM ▴ 110

0

Entering edit mode

How many protein entries are there in the file? How big is the file? You want to remove duplicates if two entries have same header or same sequences? Personally I believe a simple python or perl script should get you the result within a minute.

ADD REPLY • link 11.3 years ago by Ashutosh Pandey 12k

0

Entering edit mode

replied by editing main post

ADD REPLY • link 11.3 years ago by IsmailM ▴ 110

0

Entering edit mode

If you only care about having unique sequence & don't mind losing the IDs, you could run this: sed -n '2~2p' file.fasta | sort -u. If you want to see how often a sequence appeared, replace sort -u with sort | uniq -c.

ADD REPLY • link 11.3 years ago by Matt LaFave ▴ 310

score 1 · Answer 1 · 2013-07-31

1

Entering edit mode

11.3 years ago

Rm 8.3k

If comparisons are at the sequence level: you can use CD-HIT or uclust at a given sequence identity cutoff

ADD COMMENT • link 11.3 years ago by Rm 8.3k