Hi, I am trying to remove duplicates entries in a fasta file (containing protein sequences). I have looked for similar posts and have tried some of their fixes, but options like faidx toolkit only work with nucleotides fasta files and as of yet I haven't been able to find something that actually works.
At the moment, I extract the seq. IDs and run the following script:
#!/usr/bin/env ruby
filename = ARGV.first
text = File.read(filename)
entryid = />\S+/i
text.scan(entryid).uniq.sort.each do |output|
puts output.chomp
end
This removes all duplicate entries. I then use samtools to extract the sequence from the original file to give me a much smaller file with the seq IDs and the respective sequence.
However this is too slow - when running with large files - it has taken 4 hours and is still running.
Is there any alternative method that is a little faster..
EDIT:
I am dealing with 5 species that I run this on. I am using a batch type file to run the scripts. The largest file consists of 332,369 sequences which only contains 72,552 unique sequences.
Secondly, I would be grateful if you type any scripts in ruby since that is the programming language that I can just about get my head around at the moment - I'm a total beginner in programming/ bioinformatics.
Many Thanks
Ismail
How many protein entries are there in the file? How big is the file? You want to remove duplicates if two entries have same header or same sequences? Personally I believe a simple python or perl script should get you the result within a minute.
replied by editing main post
If you only care about having unique sequence & don't mind losing the IDs, you could run this:
sed -n '2~2p' file.fasta | sort -u
. If you want to see how often a sequence appeared, replacesort -u
withsort | uniq -c
.