array of hashes in perl
1
1
Entering edit mode
10.5 years ago
natasha.sernova ★ 4.0k

Dear collegues,
I need your advice. I have a mixture of protein sequences,
about a thousand. They are in fasta-format - their names are
always different, but their sequences sometimes are the same.
I would like to get rid of any repeats automatically.
Is there any simple way to do it? Hashes "seen" may help, but I don't know exactly how many hashes I will need to create.
To create the array of these hashes is too complicated to my mind.
Is it the only way to do it?
Many thanks for your help!

sequence • 2.3k views
ADD COMMENT
0
Entering edit mode

check How To Remove The Same Sequences In The Fasta Files? thread to remove duplicate sequences.

ADD REPLY
0
Entering edit mode

Many thanks! I am very bad in python, but this is a good reason to study it better.

I have to try, I have no choice.

ADD REPLY
1
Entering edit mode

Check pierre's post in the above link, where you can just run the command to remove duplicate without running any program.

ADD REPLY
0
Entering edit mode

Really? That's great! But I didn't quite understand, what "pierre's post"

you are talking about. It's my dream - just run the command to remove duplicates without running any program. Please, help me to find it! THOUSAND THANKS!

Natasha

ADD REPLY
0
Entering edit mode

:) Click this

ADD REPLY
0
Entering edit mode

I assume that by "repeats" you mean "duplicate sequences" as opposed to sequence repeats.

ADD REPLY
0
Entering edit mode

Yes, exactly. I mean multiple duplicate sequences.

ADD REPLY
1
Entering edit mode
10.5 years ago
JC 13k

Well, you need to use the sequence as the "key" in the hash and the sequence_id as the "value", for example:

#!/usr/bin/perl
use strict;
use warnings;
my %seqs;
$/ = "\n>";
while (<>) {
    s/>//g;
    my ($id, @seq) = split (/\n/, $_);
    my $seq = join "", @seq;
    $seqs{$seq} .= "$id,";
}

while ( my ($seq, $id) = each %seqs) {
    print ">$id\n$seq\n";
}

then you can run it as:

perl removeDuplicates.pl < original.fasta > unique.fasta
ADD COMMENT
0
Entering edit mode

THANK YOU! I was always very much afraid of such complicated data stuctures...

That's great!

Natasha

ADD REPLY
0
Entering edit mode

Your welcome, I'm glad to help.

ADD REPLY

Login before adding your answer.

Traffic: 1670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6