Dear collegues,
I need your advice. I have a mixture of protein sequences,
about a thousand. They are in fasta-format - their names are
always different, but their sequences sometimes are the same.
I would like to get rid of any repeats automatically.
Is there any simple way to do it? Hashes "seen" may help, but I don't know exactly how many hashes I will need to create.
To create the array of these hashes is too complicated to my mind.
Is it the only way to do it?
Many thanks for your help!
Really? That's great! But I didn't quite understand, what "pierre's post"
you are talking about. It's my dream - just run the command to remove duplicates without running any program. Please, help me to find it! THOUSAND THANKS!
Well, you need to use the sequence as the "key" in the hash and the sequence_id as the "value", for example:
#!/usr/bin/perl
use strict;
use warnings;
my %seqs;
$/ = "\n>";
while (<>) {
s/>//g;
my ($id, @seq) = split (/\n/, $_);
my $seq = join "", @seq;
$seqs{$seq} .= "$id,";
}
while ( my ($seq, $id) = each %seqs) {
print ">$id\n$seq\n";
}
check How To Remove The Same Sequences In The Fasta Files? thread to remove duplicate sequences.
Many thanks! I am very bad in python, but this is a good reason to study it better.
I have to try, I have no choice.
Check pierre's post in the above link, where you can just run the command to remove duplicate without running any program.
Really? That's great! But I didn't quite understand, what "pierre's post"
you are talking about. It's my dream - just run the command to remove duplicates without running any program. Please, help me to find it! THOUSAND THANKS!
Natasha
:) Click this
I assume that by "repeats" you mean "duplicate sequences" as opposed to sequence repeats.
Yes, exactly. I mean multiple duplicate sequences.