Hello all,
This is my first post here, but I will try to explain the programming problem as best as I can.
I have a data set which looks like the following
NR_046018 DDX11L1 , 0 0 1 1 1 1 1 1 1 1 0 0 0 0 1.44 2.72 3.84 4.92
NR_047520 LOC643837 , 3 2.2 0.2 0 0 0.28 1 1 1 1 2.2 4.8 5 5.32 5 5 5 5 3
NM_001005484 OR4F5 , 2 2 2 1.68 1 0.48 0 0.92 1 1.8 2 2 2 2.04 3.88 3
NR_028327 LOC100133331 , 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
What is needed
- Shuffle the array 10 times. After _each_ shuffle, divide the array into 2 new arrays, say
set1
andset2
. - From each new array, compute maximum average of each row of numbers.
- Get 10 maximum averages of each
set1
andset2
. Compute the average of the 10 maximum averages obtained for each set, let's call it10avg1
and10avg2
. - Get a list of 1000
10avg2 and 1000
10avg2`.
Code
use warnings;
use List::Util 'shuffle';
use List::Util qw(max);
my $file = 'mergesmall.txt';
open my $fh,'<',$file or die "Unable to open file";
open OUT,">Shuffle.out" or die;
my @arr = <$fh>;
my $i=10;
while($i){
my @arr1 = (); #Intitialize 1st set
my @arr2 = (); #Initialize 2nd set
my @shuffled = shuffle(@arr);
push @arr1,(@shuffled[0..1]); #Shift into 1st set
push @arr2,(@shuffled[2..3]); #Shift into 2nd set
foreach $_(@arr1){
my @val1 = split;
my $max1 = max(@val1[3..$#val1]);
$total1 += $max1;
$num1++;
}
my $average_max1 = $total1 / $num1;
#print "\n\n","Average max 1st set is : ",$average_max1;
print OUT "Average max 1st set is : ",$average_max1;
foreach $_(@arr2){
my @val2 = split;
my $max2 = max(@val2[3..$#val2]);
print "\n\n";
$total2 += $max2;
$num2++;
}
my $average_max2 = $total2 / $num2;
#print "\n\n","Average max 2nd set is : ",$average_max2;
print OUT "\n","Average max 2nd set is : ",$average_max2,"\n\n";
$i--;
}
The Problem
The code I have been able to write so far can get 10 maximum averages of each set1
and set2
. I am not able to figure out how to compute the average of these 10 maximum averages. If I can figure out this, I can easily put a for
loop to run 1000 times and obtain 1000 10avgset1
and 1000 10avgset2
Points to Note
- The actual data set has each row comprising a maximum of 400 numbers, some rows have less than that, some have none at all, but never more than 400.
- The actual dataset has 41,382 rows. Set1 will comprise of 23,558 rows and set2 will comrpise of 17,824 rows.
- File is a .txt file and all the numbers in each row are tab delimited.
Could you please explain what the application to bioinformatics is? It looks like you are doing some resampling here? Did I get it right that you you want to compute maximum of the averages, not maximum and averages?
@Michael Hello Michael! Thank you for your comment. This data is a small part of ChIP-Seq data for K562 cell line which I've been given to analyze. Yes we are doing some resampling here, we are trying to generate a control set actually. And thank you for asking for a clarification, I think I should reframe the question. I need to compute the average maximum for all rows. So for example, I find the maximum value in NR046018, which is 4.92 here. Similarly for NR047520(5.32) and so on for all the rows(23,558 in set1) and (17,824 in set2). Once these maximum values are found, I need to find what is the average maximum.
And since we are trying to generate a control set, I have to shuffle the main data(one which has 41,382 rows. This main dataset was generated by combining two pre-existing datasets 1 and 2). So for each shuffle, we divide the new shuffled array into 2 new arrays, compute average maximum for each of those new sets, and we shuffle 10 times , obtaining 10 average maximums for each set. So now, we have 10 average maximums for set 1 and similarly for set 2. (I have been able to do it this far) From these 10 average maximums, I need to find the mean average. And then this process of 10 shufflings neds to be repeated 1000 times, so I have 1000 mean averages. I hope I was able to explain myself a little better...