Hi there,
I am new to analyzing sequencing data. I have received data from someone where the text file contains 1 entry for each base in the genome. The file format is
<physical_pos> <genetic_distance>....
For the downstream analysis, I need to load this in memory. However, this makes the program very slow and it runs out of the memory. I think I need to use a dB solution but I have never used one before. Would anyone have suggestions about how to set this up or recommendations for a better solution for handling this data instead of using hashes?
Thank you in advance.
Diviya
You might just want to tell us what you're trying to do and show an excerpt of the file so we can advise you. There are probably already solutions for what you want to do.
Yeah I agree with dpryan. You might not have to load it into memory depending on what you want to do.
Hashing is probably not the right solution to being with. Load it into a numeric array that is pre-sized to the expected size (numpy).
As others also mention it you should provide more information to get a reliable answer.
Sorry for the delay but I am trying to basically remove some genes and x cM regions around the genes. I have the physical positions of the start and end of the genes and based on the genetic position, I want to remove the gene + flanking region. I agree, I can upload the sites as an array but that still does not solve the memory problem and is still very slow if you have millions of entries
It shouldn't be a problem to fit that and the genome into the memory of any standard computer these days (unless you're dealing with a really large plant genome, I suppose). I guess the question becomes how you're storing it and how much memory you have (though really, my 6 year old laptop could handle that).
I am giving the problem 10GB of memory and it still runs out of memory. I am currently doing something very simple -
while (my $line = <file>) { #read line and store in array } However, I think the db solution would be much better. I am looking into Berkeley db instead. Not sure if that the best solution but in reading some other posts, it appears better than just increasing the memory.
Perl? Arrays of scalars in perl will take up much more space than needed in other languages (edit: a quick bit of googling suggests 56 bytes per number, which is just crazy). If you really want to use perl (sorry, I loathe the language), then have a look at PDL, which is sort of like numpy for perl.
If you store your data with pack instead of an array of scalars then 4 bytes per int will be used (on a 32bit machine). That aside, I'm certain that reading all the data into an array is not the best approach. OP noted above that they are only interested in "some genes and x cM regions around the genes" so it's likely a standard toolkit for getting sequence regions could be used (maybe with a little scripting).
That would seem reasonable.
the word "slow" is not all that informative - reading 250 million lines and not even doing anything with them will take some time of course - but then simply putting them into an array is not going to make that any slower, nor is storing 250 million numbers such a large amount of memory. taking 4 bytes per number it adds up to something around 800Mb. That's not that much.