Question

array of hashes in perl

1

Entering edit mode

10.9 years ago

natasha.sernova ★ 4.0k

Dear collegues,
I need your advice. I have a mixture of protein sequences,
about a thousand. They are in fasta-format - their names are
always different, but their sequences sometimes are the same.
I would like to get rid of any repeats automatically.
Is there any simple way to do it? Hashes "seen" may help, but I don't know exactly how many hashes I will need to create.
To create the array of these hashes is too complicated to my mind.
Is it the only way to do it?
Many thanks for your help!

sequence • 2.6k views

ADD COMMENT • link updated 10.9 years ago by JC 13k • written 10.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

check How To Remove The Same Sequences In The Fasta Files? thread to remove duplicate sequences.

ADD REPLY • link 10.9 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Many thanks! I am very bad in python, but this is a good reason to study it better.

I have to try, I have no choice.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

Check pierre's post in the above link, where you can just run the command to remove duplicate without running any program.

ADD REPLY • link 10.9 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Really? That's great! But I didn't quite understand, what "pierre's post"

you are talking about. It's my dream - just run the command to remove duplicates without running any program. Please, help me to find it! THOUSAND THANKS!

Natasha

ADD REPLY • link 10.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

:) Click this

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

I assume that by "repeats" you mean "duplicate sequences" as opposed to sequence repeats.

ADD REPLY • link 10.9 years ago by Neilfws 49k

0

Entering edit mode

Yes, exactly. I mean multiple duplicate sequences.

ADD REPLY • link 10.9 years ago by natasha.sernova ★ 4.0k

Ram · Answer 1 · 2014-06-03

1

Entering edit mode

10.9 years ago

JC 13k

Well, you need to use the sequence as the "key" in the hash and the sequence_id as the "value", for example:

#!/usr/bin/perl
use strict;
use warnings;
my %seqs;
$/ = "\n>";
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    $seqs{$seq} .= "$id,";
}

while ( my ($seq, $id) = each %seqs) {
    print ">$id\n$seq\n";
}

then you can run it as:

perl removeDuplicates.pl < original.fasta > unique.fasta

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by JC 13k

0

Entering edit mode

THANK YOU! I was always very much afraid of such complicated data stuctures...

That's great!

Natasha

ADD REPLY • link 10.9 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Your welcome, I'm glad to help.

ADD REPLY • link updated 5.3 years ago by Ram 45k • written 10.9 years ago by JC 13k