Fastest way to process fasta file?
1
0
Entering edit mode
6.2 years ago
gewa ▴ 20

Hi, I have a model that takes in a one-hot encoded sequence for a location on the human genome. However, I'm having trouble to find a way to not be limited by either time or space constraints when reading in the FASTA file with the sequences for the model. The file covers the whole human genome, so it's quite large. So far, I've used bedtools getfasta to essentially make a fasta file that has the sequences grouped how I need (in 100 bp bins). After that, I've tried to load that file and pre-generate all the one-hot encodings of the sequences necessary for my model, but this results in me running out of memory. Conversely, when I try and access the fasta and generate the one-hot encoding on the fly (as needed to input into the model), my performance is quite slow (I expect due to all the file i/o).

Does anyone have any suggestions for how to organize this sequence data/parse this fasta file in a fast way (i.e., avoiding both constant file i/o AND loading the entire file into memory)? Any help is very appreciated. Thanks!

fasta sequence • 2.5k views
ADD COMMENT
2
Entering edit mode

There is literally no way to avoid in-memory storage and streaming if you wish to read a large file. You can break it up into smaller files, but that will only help if you can process them in parallel independently, that is, if processing one chunk does not depend on processing a different chunk.

ADD REPLY
0
Entering edit mode

Thanks. I will try and come up with a chunking strategy compatible with my use case

ADD REPLY
0
Entering edit mode

Can you do it on a chromosome by chromosome basis? i.e, load a chromosome into memory, encode as necessary, dump it, move on to the next chromosome? This would also have the benefit of being able to process them in parallel if you had the resources, as Ram mentioned.

ADD REPLY
0
Entering edit mode

Maybe if you encode the sequence as 2bit and load it into memory?

ADD REPLY
1
Entering edit mode
6.2 years ago
sacha ★ 2.4k

Try memory map . It maps your file in a virtual memory . Then you can read data from this memory without memory exceed.

For exemple :

ADD COMMENT

Login before adding your answer.

Traffic: 2019 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6