Hello,
For a set of patients, I need to store in Python a set of annotations corresponding to a (potentially very large) set of positions (quite similar to a VCF file for each patient), and I am looking for a Python data structure that takes minimal space and can efficiently be queried (by position) later on.
I can think of different solutions. For each patient:
- A dict where the key is the position and the value is a list containing the annotations in a given order (or another dict);
- A list of sublists (or dicts), where each sublist stores the annotations in a given order. An additional lookup table would give the mapping between the genomic position and the index in the list;
- A wormtable;
- A Pygr structure optimzed for annotations querying, as described here.
And maybe pickling the resulting object?
I am quite new in the field, so could you tell me what would be the preferred solution in terms of (space + time) performance in this case? Or maybe there are better options?
Thanks for your help.
Is it possible to have your data, at least initially, in VCF format? Are you using just custom annotations or also annotating with either VEP or snpEff? If you are working with VEP or snpEff annotated VCF files (or can work with them) than GEMINI is a pretty great sqlite3 based database for further annotating and managing variants from a patient cohort.
Of course if you can't do that then are many great answers below for custom solutions.