Question

Efficient Python data structure to store and process annotations

1

Entering edit mode

10.5 years ago

lilla.davim ▴ 160

Hello,

For a set of patients, I need to store in Python a set of annotations corresponding to a (potentially very large) set of positions (quite similar to a VCF file for each patient), and I am looking for a Python data structure that takes minimal space and can efficiently be queried (by position) later on.

I can think of different solutions. For each patient:

A dict where the key is the position and the value is a list containing the annotations in a given order (or another dict);
A list of sublists (or dicts), where each sublist stores the annotations in a given order. An additional lookup table would give the mapping between the genomic position and the index in the list;
A wormtable;
A Pygr structure optimzed for annotations querying, as described here.

And maybe pickling the resulting object?

I am quite new in the field, so could you tell me what would be the preferred solution in terms of (space + time) performance in this case? Or maybe there are better options?

Thanks for your help.

annotations Python • 6.1k views

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by lilla.davim ▴ 160

0

Entering edit mode

Is it possible to have your data, at least initially, in VCF format? Are you using just custom annotations or also annotating with either VEP or snpEff? If you are working with VEP or snpEff annotated VCF files (or can work with them) than GEMINI is a pretty great sqlite3 based database for further annotating and managing variants from a patient cohort.

Of course if you can't do that then are many great answers below for custom solutions.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by DG 7.3k

Ram · Answer 1 · 2014-06-23

How big are the files in terms of disk space? If it is considerably more than your memory size, maybe you should consider a persistent store rather than optimizing data structures. You can spend all your time optimizing data structure, but that is probably not your goal.

If the data is pretty simple, you can go with a popular nosql solution (mongodb) that doesn't take much code to load data. Querying by position won't be as fast as in memory querying, but it will get the job done.

If the data will fit well within your memory, and you are just looking for a data structure for fast querying by position, interval trees are pretty good. bx-python is a mature library for constructing interval trees. I also wrote a very basic interval tree class in python here. Note that the interval tree structure will take up significantly more space than just the raw data size.

Ram · Answer 2 · 2014-06-23

3

Entering edit mode

10.5 years ago

jurgnjn ▴ 70

Have a look at GenomicInterval/GenomicArray/GenomicArrayOfSets in HTSeq. The default storage implementation uses red-black trees for fast lookup (by position/interval), and is pickle-compatible.

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by jurgnjn ▴ 70

Ram · Answer 3 · 2014-06-23

3

Entering edit mode

10.5 years ago

lh3 33k

Two questions:

Is each annotation always associated with one position or with an interval?
Can you put the whole dataset in memory?

Solutions:

Position, in memory: a python dict
Position, on disk: database. Berkeley DB or sqlite. Wormtable is fine, which is built on a Berkeley DB. The "interval, on disk" solutions below also work.
Interval, in memory: pygr may do the work if it keeps the data in memory. Pygr uses a special but proper interval index. bx-python seems good, too. You can also use list plus an auxiliary index as you described in your question - that will look like the linear index in tabix. The implementation is quite simple and probably more efficient than a proper interval tree in most cases.
Interval, on disk: tabix or bigbed, both of which have python bindings. They both work with >100GB data sets in the compressed form. Sqlite with spatial index should also work, but it does not use efficient compression if I am right.

Re HTSeq: It seems that it only works with non-overlapping intervals. I don't think we can implement generic interval/spatial index with one standard red-black tree.

Re relative size of an interval tree: it depends on the implementation and the size of values. The index can be very small wrt data. Tabix essentially uses an interval tree. For a ~100GB compressed multi-sample VCF, the index size is typically <10MB uncompressed.

ADD COMMENT • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by lh3 33k

0

Entering edit mode

The GenomicArrayOfSets class in HTSeq is specifically written for handling overlapping features...

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by jurgnjn ▴ 70

0

Entering edit mode

I was trying to find out how to use a red-black tree to achieve interval queries because I am not aware of any such algorithms. However, in its preprint, red-black is only mentioned once and it is used for indexing wiggle-like numbers. It is not used for interval queries. GenomicArrayOfSets may be able to achieve proper interval queries, but I doubt it is using a red-black tree. If it is, I'd love to learn how that is achieved.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by lh3 33k

0

Entering edit mode

I reread the preprint. I see how HTSeq achieves interval queries now. It is converting overlapping features to non-overlapping intervals. The algorithm is good for thousands of gene annotations and for the purpose of HTSeq, but is not a proper one for generic interval queries. The index will be huge given millions to billions of records.

ADD REPLY • link updated 3.1 years ago by Ram 44k • written 10.5 years ago by lh3 33k

Ram · Answer 4 · 2014-06-23

Assuming that your target set doesn't change often, and that you're not tied to Python (or you can use subprocess or similar Python modules to run command-line tasks) you could pre-process it once and then use some command-line tools to search it quickly:

Convert VCF data to BED.
Compress the BED data with Starch to get better compression performance than bzip2 (which is itself more efficient than gzip).
Use the bedextract utility or bedmap --faster to very quickly query the Starch-compressed target variant dataset, looking for variants in regions of interest.

Starch indexes by chromosome, so the index is a very small part of the compressed archive (generally a few kB even for whole-genome-scale BED inputs); fast queries are accomplished with bedextract or bedmap --faster by a binary search through pre-sorted data. With this approach, you can meet low disk space and system memory requirements, while obtaining fast retrieval times.

The cost is in the up-front conversion and compression time, but if you do a lot of queries, you make up for it pretty quickly.

Ram · Answer 5 · 2014-06-23

I had this problem with the Exome Variant Server annotations:

I've downloaded the EVS data with https://github.com/lindenb/jvarkit/wiki/EVS2Bed it generates a BED+XML file:

1   69427   69428   <snpList><positionString>1:69428</positionString><chrPosition>69428</chrPosition><alleles>G/T</alleles><ua..
1   69475   69476   <snpList><positionString>1:69476</positionString><chrPosition>69476</chrPosition><alleles>C/T</alleles><ua..
1   69495   69496   <snpList><positionString>1:69496</positionString><chrPosition>69496</chrPosition><alleles>A/G</alleles><ua..
1   69510   69511   <snpList><positionString>1:69511</positionString><chrPosition>69511</chrPosition><alleles>G/A</alleles><ua...
1   69589   69590   <snpList><positionString>1:69590</positionString><chrPosition>69590</chrPosition><alleles>A/T</alleles><ua...
1   69593   69594   <snpList><positionString>1:69594</positionString><chrPosition>69594</chrPosition><alleles>C/T</alleles><ua..
1   69619   69620   <snpList><positionString>1:69620</positionString><chrPosition>69620</chrPosition><alleles>T/TA</alleles><u...
1   69744   69745   <snpList><positionString>1:69745</positionString><chrPosition>69745</chrPosition><alleles>CA/C</alleles><u..
1   69760   69761   <snpList><positionString>1:69761</positionString><chrPosition>69761</chrPosition><alleles>T/A</alleles><ua...
1   801942  801943  <snpList><positionString>1:801943</positionString><chrPosition>801943</chrPosition><alleles>T/C</alleles...

I can then access this file with tabix or with custom programs like: https://github.com/lindenb/jvarkit/wiki/VCFTabixml