I want to keep track of how many reads are overlapping each nucleotide in my genome, so that I can quickly get the number of overlapping reads with dict[chromosome][strand][nucleotide].
This is possible with intervaltrees from bx-python, but afair, that module did not work with multiprocessing, and it might not be the fastest for this specific task.
Are there other modules/packages that do this?
You already have a read count for each chromosome/strand/position (a wig/bedgraph file)? How much memory do you have on your machine and how big is your genome? You don't really need to do anything too clever if you have enough hardware.
Thanks. I'm trying to revive and old peak caller algorithm that performed better than macs2 in many benchmarks. Problem is (apart from it being brittly coded) that it uses a lot of comparisons between #reads in nucleotide positions, so I better have to come up with something clever :) I think bx-python/cython and some thinking will save me though.
I don't think you'll get lookups significantly faster than a dict, unless you go with another language. Dicts are pretty well optimized in python since so much depends on them.
I don't see how interval trees will be faster than dict lookups either unless you are looking for range overlaps?
Looking at your post history, you are trying to write a more performant version of SICER? DeepTools (http://deeptools.readthedocs.io/en/latest/index.html) does a lot of these types of genome wide coverage operations. And it is written in python. Maybe you can get some ideas from them.
Nah, it is a 2012 method called triform, the new sicer is already out at https://github.com/endrebak/epic