Question

Access Hash With 250M Entries

0

Entering edit mode

10.9 years ago

diviya.smith ▴ 60

Hi there,

I am new to analyzing sequencing data. I have received data from someone where the text file contains 1 entry for each base in the genome. The file format is

<physical_pos> <genetic_distance>....

For the downstream analysis, I need to load this in memory. However, this makes the program very slow and it runs out of the memory. I think I need to use a dB solution but I have never used one before. Would anyone have suggestions about how to set this up or recommendations for a better solution for handling this data instead of using hashes?

Thank you in advance.

Diviya

parsing • 2.7k views

ADD COMMENT • link updated 10.9 years ago by Alex Reynolds 36k • written 10.9 years ago by diviya.smith ▴ 60

1

Entering edit mode

You might just want to tell us what you're trying to do and show an excerpt of the file so we can advise you. There are probably already solutions for what you want to do.

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

Yeah I agree with dpryan. You might not have to load it into memory depending on what you want to do.

ADD REPLY • link 10.9 years ago by Damian Kao 16k

0

Entering edit mode

Hashing is probably not the right solution to being with. Load it into a numeric array that is pre-sized to the expected size (numpy).

As others also mention it you should provide more information to get a reliable answer.

ADD REPLY • link 10.9 years ago by Istvan Albert 102k

0

Entering edit mode

Sorry for the delay but I am trying to basically remove some genes and x cM regions around the genes. I have the physical positions of the start and end of the genes and based on the genetic position, I want to remove the gene + flanking region. I agree, I can upload the sites as an array but that still does not solve the memory problem and is still very slow if you have millions of entries

ADD REPLY • link 10.9 years ago by diviya.smith ▴ 60

0

Entering edit mode

It shouldn't be a problem to fit that and the genome into the memory of any standard computer these days (unless you're dealing with a really large plant genome, I suppose). I guess the question becomes how you're storing it and how much memory you have (though really, my 6 year old laptop could handle that).

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

I am giving the problem 10GB of memory and it still runs out of memory. I am currently doing something very simple -

while (my $line = <file>) { #read line and store in array } However, I think the db solution would be much better. I am looking into Berkeley db instead. Not sure if that the best solution but in reading some other posts, it appears better than just increasing the memory.

ADD REPLY • link 10.9 years ago by diviya.smith ▴ 60

1

Entering edit mode

Perl? Arrays of scalars in perl will take up much more space than needed in other languages (edit: a quick bit of googling suggests 56 bytes per number, which is just crazy). If you really want to use perl (sorry, I loathe the language), then have a look at PDL, which is sort of like numpy for perl.

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

If you store your data with pack instead of an array of scalars then 4 bytes per int will be used (on a 32bit machine). That aside, I'm certain that reading all the data into an array is not the best approach. OP noted above that they are only interested in "some genes and x cM regions around the genes" so it's likely a standard toolkit for getting sequence regions could be used (maybe with a little scripting).

ADD REPLY • link 10.9 years ago by SES 8.6k

0

Entering edit mode

That would seem reasonable.

ADD REPLY • link 10.9 years ago by Devon Ryan 105k

0

Entering edit mode

the word "slow" is not all that informative - reading 250 million lines and not even doing anything with them will take some time of course - but then simply putting them into an array is not going to make that any slower, nor is storing 250 million numbers such a large amount of memory. taking 4 bytes per number it adds up to something around 800Mb. That's not that much.

ADD REPLY • link 10.9 years ago by Istvan Albert 102k

Ram · Answer 1 · 2014-01-28

If you can put your datasets into two separate, tab-delimited text files, into a format called BED, you can use the bedops application in the BEDOPS suite to do set operations with very little memory.

Set operations done with bedops can include exclusion of elements in one set if they overlap the second set. For example, you want to remove genes from one set, which are contained in a second set.

For example, we have two sorted BED files called A and B. We want all elements of A, which do not overlap elements in set B, written to a file called C. To do this, we use the --not-element-of operator:

$ bedops --not-element-of A B > C

The advantage of using bedops is that it requires very little memory, only requiring sorted inputs. BEDOPS includes a binary called sort-bed, which prepares BED files for use with BEDOPS tools. Sorting is as simple as:

$ sort-bed unsortedSet.bed > sortedSet.bed

Ram · Answer 2 · 2014-01-27

Sorry for the delay but I am trying to basically remove some genes and x cM regions around the genes. I have the physical positions of the start and end of the genes and based on the genetic position, I want to remove the gene + flanking region. I agree, I can upload the sites as an array but that still does not solve the memory problem and is still very slow if you have millions of entries

Why don't you convert your file into BED format? You'll need split your <physical position> field into chrm, start, and stop fields separated by tab characters. Then you can use BEDtools to subset your file as needed.

However, if your <physical position> fields aren't unique I think you're going to have to write some custom code to loop over the file each time you want query it.