Hi all,
I am trying to store information of about 2000 individuals from 2 populations with each individual having about 500,000 markers (SNPs) each. Unfortunately perl runs out of memory after storing about 60 individuals...
First I store the ~500,000 SNPs for an individual using Bio::PopGen::Individual, by doing the following in a for loop of 500,000 iterations:
$ind->add_Genotype(
Bio::PopGen::Genotype->new(-alleles => ["$A1", "$A2"],
-marker_name => "$rsid{$i}")
);
Then I store the individual in a population with Bio::PopGen::Population as follows:
$population{$pop}->add_Individual($ind);
I do this in a "while" loop for each line of my input file (which consists of about 2000 lines, each containing the individual ID, population membership and two alleles for 500,000 SNPs). Unfortunately, as I said, I get the message "Out of memory!" after having processed only about 60 individuals, despite using "undef" to empty all arrays/hashes that I stop using after the for and while loop.
The reason I am trying to do this, is because I want to calculate the Fst using Bio::PopGen::PopStats using all individuals and all markers. I'm sure there is a more efficient way of doing this, without using so much memory... Does anyone have any suggestions? Many thanks!
Thanks, that's a good suggestion. If I would transpose my input file to have a SNP on each row instead of an individual on each row, that might work. Would an overall FST then be simply the mean of all 500,000 FSTs, or is it more complicated than that?
The mean is the point estimate of the FST and you can also calculate error. Also you could use FDIST2 which has some cool functionality.
There are a ton of FST calculations which are you using?