printing duplicate data in perl using hash keys
2
0
Entering edit mode
10.2 years ago
tg ▴ 10

Hi all, I have a file that looks like this

protein_id               goterm          product
AG8P000026-PA    GO:0003824    catalytic activity
AG8P001026-PA    GO:0004181    metallocarboxypeptidase activity
AG8P001026-PA    GO:0008233    peptidase activity
AG8P001039-PA    GO:0016787    hydrolase activity
AG7P001036-PA    GO:0004182    carboxypeptidase A activity
AG7P001036-PA    GO:0004180    carboxypeptidase activity
AG7P001040-PA    GO:0022237    metallopeptidase activity

when I use perl hash it prints

what it prints

AG8P000026-PA    GO:0008233    peptidase activity
AG8P001039-PA    GO:0016787    hydrolase activity
AG7P001036-PA    GO:0004180    carboxypeptidase activity
AG7P001040-PA    GO:0022237    metallopeptidase activity

but I want it to print

AG8P000026-PA    GO:0003824    catalytic activity
AG8P001026-PA    GO:0004181    metallocarboxypeptidase activity
AG8P001026-PA    GO:0008233    peptidase activity
AG8P001039-PA    GO:0016787    hydrolase activity
AG7P001036-PA    GO:0004182    carboxypeptidase A activity
AG7P001036-PA    GO:0004180    carboxypeptidase activity
AG7P001040-PA    GO:0022237    metallopeptidase activity

please How do I modify this code?

perl code:

my $filename ="data";
open(my $INFILE,$filename)|| die("Error in  reading file $filename");  
my %infodata;
while(my $line= <$INFILE> )
{
    chomp $line;
    my ($id,@info)= split /\t/,$line;
    $infodata{$id} =join("\t",@info);      
}
perl duplicates hash • 5.5k views
ADD COMMENT
0
Entering edit mode

Why do you think it prints Ben only once?

ADD REPLY
0
Entering edit mode

In perl hash prints a key once and discard the rest. Am writing a program that is similar to what is posted. i have similar output

ADD REPLY
0
Entering edit mode

Hello tg!

We believe that this post does not fit the main topic of this site.

This isn't a perl forum.

For this reason we have closed your question. This allows us to keep the site focused on the topics that the community can help with.

If you disagree please tell us why in a reply below, we'll be happy to talk about it.

Cheers!

ADD REPLY
0
Entering edit mode

I disagree. perl is a programming language I used in bioinformatics. I have a problem that is a model of what I just posted it's not fair to do this

ADD REPLY
2
Entering edit mode

Arguing with an administrator about what's fair is about the least effective thing one can do (hint: I'm familiar with the community standards here). If you update your question to at least include a biologically relevant example then it'll be relevant to the site and I'll reopen the post (add a reply too, since I don't get notified when posts are modified).

ADD REPLY
0
Entering edit mode

I have put up a sample of what I am working on can you please open my thread so I can get help from people that cares

ADD REPLY
0
Entering edit mode

Yup, it's been reopened.

ADD REPLY
0
Entering edit mode

Hi Devon this is a bioinformatics forum. I guess one is free to ask question to solve a problem one has doubt on. You say we believe you are acting alone because I don't believe I have asked a silly question. my program solves a bioinformatics question. i just put a model up. that is ash thing to do and insensitive thing to do.

ADD REPLY
0
Entering edit mode

Hi tg,

I do not know the first revision of your question but I guess it didn't show the example data with GO terms, is this correct? If so, the initial decision to close the question was correct. If you have doubts about what kind of questions you are free to post but instead of complaining you should rather take the advise to improve your question. Your question is still only weakly related to bioinformatics because it is mainly about basic programming (use of hash) applied to parsing a tab separated file which by chance contains GO terms. That's good enough to keep your question open. But you still need to improve it and make it more specific, because at the moment your desired output is identical to the input. So why parse at all?

ADD REPLY
0
Entering edit mode

The original version dealt with students in different classes and their scores. Hopefully tg will update again with the actual goal, since it's also completely unclear to me why one would use a hash to simply copy print a tsv's contents.

ADD REPLY
0
Entering edit mode

I want to add annotation to my genbank file but because the protein id is the same but diff protein annotation details it's updating only one protein id out of the identical protein id. I want to be able to update all the protein id but it seems hash only specifies one key. That reflects in the sample data I posted.

ADD REPLY
0
Entering edit mode

I want to add annotation to my genbank file but because the protein id is the same but diff protein annotation details it's updating only one protein id out of the identical protein id. I want to be able to update all the protein id but it seems hash only specifies one key. That reflects in the sample data I posted.

ADD REPLY
1
Entering edit mode
10.1 years ago
Felix_Sim ▴ 260

Your problem lies with adding using identical keys and assigning different values to it. Basically what you're doing is similar to saying the following.

$x = 10;
$x = 11;
$x = 12;
print $x;

which will give you x = 12, because it is the last assigned value.

To solve your problem you need to consider the following lines:

my ($id,@info)= split /\t/,$line;
$infodata{$id} =join("\t",@info);

With every run of your while loop you will assign a new value to a previously assigned key! In order to solve this you may want to consider using a different key (remember, they have to be unique to avoid the problem you're encountering). Consider this maybe:

my @info = split /\t/, $line;
$infodate{$info[1]} = join ("\t", @info);

What I have done is instead of using the protein_id as key, I've used the go term. This seems to be unique, at least for the data you've provided and would make sense. You can still print your entire line as this is the value to each goterm key.

ADD COMMENT
0
Entering edit mode
10.1 years ago
JC 13k

As others already mentioned, a perl hash will overwrite each duplicate record, instead of assign the content, you can append the value if the key exists:

my $filename ="data";
open(my $INFILE,$filename)|| die("Error in  reading file $filename");  
my %infodata;
while(my $line= <$INFILE> )
{
        #chomp $line;
        my ($id,@info)    = split /\t/,$line;
        $infodata{$id} .= $_;      

}
ADD COMMENT
0
Entering edit mode

Generally speaking, this is often a useful approach but it's not clear what your code is supposed to be doing. Also, I don't think it would make sense in this case because how would you separate the values? It would be easier to just join the values based on some character, or better yet, use an array to hold values for a key.

ADD REPLY

Login before adding your answer.

Traffic: 2454 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6