Entering edit mode
6.6 years ago
ArusjakGevorgyan
▴
30
Hi everyone. I trying to measure the frequency of, how often in my data I have synonyms and not synonymous mutations of a certain frequency.
So here is my data(input):
EOG090X0003;96;2;70;G:0.985714;A:0.0142857;G;SYN EOG090X0003;186;2;60;G:0.983333;A:0.0166667;G;SYN EOG090X0003;342;2;86;C:0.976744;T:0.0232558;C;NON
I.m trying to get this output:
Frequency Syn Non 0.01 1 0.015 1 0.02 1 0.025
Problem with my code is that I only get information about the last gene and not for all tre genes. I will be really greatful for all you help.
#!/usr/bin/perl -w
use strict;
use warnings;
use 5.010;
my $file = $ARGV[0];
if ( ! open ( FILE_HANDLE , "<" , $file ) ) {
die "Error can't find the file: $file because $!";
}
if ( ! open ( OUT_HANDLE , ">" , $file . ".csv" ) ) {
die "Error can't make new file file: ${file}.csv because $!";
}
print OUT_HANDLE "FREQ\tSYN\tNON_SYN\n";
my $syn_cnt = 0;
my $non_cnt = 0;
my $stop_cnt = 0;
my $freq;
my $annotation;
my %hashSYN = ();
my %hashNON = ();
my @freqLst = (0.01,0.015,0.02,0.025);
my $key;
while ( my $line = <FILE_HANDLE> ) {
chomp $line;
my @words = split(/\t/, $line);
for my $row (@words){
if ($row =~ m /\?/){
next;
#EOG09371BTU;212;2;94;G:0.989362;A:0.0106383;G;NON
}elsif($row =~ m /^\w+\;\d+\;\d+\;\d+\;\w+\:\d.+\;\w+\:(\d.+)\;\w+\;(\w+)$/){
$freq= $1;
$annotation = $2;
for my $pos (@freqLst){
if($freq > $pos){
$key = $pos;
}else{
if($annotation eq 'SYN'){
$syn_cnt = $hashSYN{$key};
$syn_cnt++;
$hashSYN{$key} = $syn_cnt;
}elsif($annotation eq 'NON'){
$non_cnt = $hashNON{$key};
$non_cnt++;
$hashNON{$key} = $non_cnt;
}
}
}
}
}
}
Just a little re-write of your for loop
You only need a single hash to store your values, if you use a 2-level one you can based on the $annotation count it in the correct category. Also note that I go over the @FreqList in reverse order, so assign the counts to the correct bin (in your case also 0.025 > 0.01 so it will not be counted in the correct bin (== will actually be counted in each bin lower than $freq )
Thank you so much for all your help.
what are the column (field) names? which one is frequency column and there are no genes in your example data. Assuming that 6th column has frequency data, OP output frequency is truncated unequally (0.01 - 2 digits after point vs 0.015 - 3 digits after point). It is also not clear if you want frequency per gene or overall. Instead of frequency, it seems you are simply counting (1) frequency for each category. It is clear that you would like solution in perl, but there are easier CLI programs to count.
Hi, thank you for your time, so here is one of my genes:
Column 1 Col2 Col3 Col4 Col5 . Col6 Col7 Col8 EOG090X0003 ; 96 ; 2 ; 70 ; G:0.985714 ; A:0.0142857 ; G ; SYN
I want to get the frequency in column 6 (0.0142857 ) see if it is between frequency values 0.01-0.015 or 0.015-0.02, so in this case, it is between 0.01-0.015 so I want to add with one (+1) in that specific row with frequency 0.01 under the synonymous column.
Your ranges over lap (0.01-0.015, 0.015-0.02). In your code, there are four frequencies: 0.01,0.015,0.02,0.025. Synonymous classification has two ranges (0.01 and 0.015) and same is the case with non-syn and same is in your expected output in OP. This is confusing.
Example code below in awk (syn range=0.01-<0.015, non-syn range= >=0.015 and <=0.02):
input:
Nice approach but this is OK for a single input gene but I assume she wants to process a whole list of genes (and count the sum over all of them)? So one will need to extend the awk to account for this.
that is true. If OP furnishes sample data that would be helpful.
Thank you for your time and your help.
Is this your complete code/script? I only see a single print statement (at the top of your script).
Yes it is because I'm not done yet since I don't know how to save my counts of nr synonymous and nonsynonymous for a specific frequency row. I'm thinking of saving it in a list and print the list maby?
Are you restricted in using perl or would be other language ok as well?
I would prefer Perl since I'm trying to learn it and I'm new in this field but I can test other languages.
I'm all for PERL since I learned that few decades ago. however as you're 'new' to the field I might suggest (consider) starting with python rather than perl . This as a side not though ;)
Then how do you conclude you only have the data for the last row? Perl-debugger ?
I'm running my script on the terminal and I get the output on the screen, I print the counts and frequency but I get only for the last row instead of all 3 rows.
That I'm puzzled about ... to screen or to file you should do some print statement at the end of your script (after processing the lines) to see anything, no?