We have developed a new tool called GEMINI (GEnome MINIng) to facilitate the exploration of genetic variation in the context of a wide range of genome annotations that are crucial to interpretation and prioritization. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration.
By loading both genetic variants in VCF format and genome annotations into a unified SQLite database, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. GEMINI is well-suited to exploring variation in personal genomes and family based genetic studies, and it scales to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.
Source code || Documentation || Manuscript || Overview Presentation || Installation || Mailing list
The GEMINI project was conceived in the Quinlan lab, but it has also benefited from fantastic collaborations with Brad Chapman, Rory Kirchner, and Oliver Hofmann at the Harvard School of Public Health
To get started with GEMINI, one needs a valid VCF file based on Human Genome coordinates from Build 37 (hg19) of the human genome. We expect that you have annotated with VCF with either snpEff (instructions here) or VEP (instructions here). You then simply load the VCF into GEMINI with the load
command. This populates a GEMINI database with the variants and automatically annotates variants all built-in annotations.
# assumes VCF has been annotated by snpEff
$ gemini load -v my.vcf -t snpEff my.gemini.db
One can also provide a PED file to define relationships among samples (useful for finding variants that meet expected inheritance patterns) and to define the sex and disease status of the samples.
$ gemini load -v my.vcf -t snpEff -p my.ped my.gemini.db
Loading is very computationally expensive; therefore, the work can easily be distributed among either multiple CPUs on a single machine:
$ gemini load --cores 8 -v my.vcf -t snpEff my.gemini.db
or distributed on a computing cluster that leverages either SGE, LSF or Torque:
# LSF
$ gemini load --cores 128 --lsf-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db
# SGE
$ gemini load --cores 128 --sge-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db
# Torque
$ gemini load --cores 128 --torque-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db
Once loaded, one can begin exploring genetic variation using either the "query" interface (see here for more details):
$ gemini query -q "select chrom, start, end, ref, alt from variants \
where is_lof = 1 \
and aaf >= 0.01" my.gemini.db
In particular, see the section on accessing and filtering upon sample genotype information.
For example, to select genotypes for a specific sample (sample1):
$ gemini query -q "select chrom, start, end, ref, alt, gts.sample1 from variants \
where is_lof = 1 \
and aaf >= 0.01" my.gemini.db
One can also apply genotype filters with the gt-filter
option. This will return only those variants that meet the specific genotype criteria you enforce. Here is an example of a filter that enforces an autosomal recessive inheritance pattern. Note that these patterns follow Python syntax.
$ gemini query -q "select chrom, start, end, ref, alt, gts.mom, gts.dad, gts.kid from variants \
where is_lof = 1 and aaf >= 0.01" \
--gt-filter "gts.dad == HET and gts.mom == HET and gts.kid == HOM_ALT" \
my.gemini.db
In addition, there are many built-in tools for conducting common analyses and finding variants that meet inheritance patterns that make sense for the phenotype you are studying. Please see here for more details.
Find de novo variants
$ gemini de_novo my.gemini.db
Find variants meeting an autosomal recessive inheritance pattern
$ gemini autosomal_recessive my.gemini.db
Find variants meeting an autosomal dominant inheritance pattern
$ gemini autosomal_dominant my.gemini.db
Lastly, we see GEMINI as a framework for researchers to develop their own new tools, and methods. We see the GEMINI database as the "API" and given that SQLite databases are portable, the code you develop based upon the Python API will work on any GEMINI database.
from gemini import GeminiQuery
gq = GeminiQuery("my.db")
gq.run("select chrom, start, end from variants")
for row in gq:
print row
We are constantly adding features, yet if there is something you would like to see added, please let us know (preferably using the mailing list).
@Aaronquinlan: Thanks for sharing: looks interesting tool; need to try...
I am adding this to our home grown LIMS!
Please let us know if you have any troubles or suggestions.
I am interested in using GEMINI to store and annotate CNVs. However, am I reading the documentation right in that only the most highly affected transcript would be stored? Perhaps I could still store the CNVs and some annotation in the SQLite DB and annotate for gene overlap on the fly...
The impact on othr transcripts is stored in the variant_impacts table.
How do you see Gemini can handle population (1000s of gVCFs) level human WES or WGS data?
The input to GEMINI is a single VCF, which can be created by combining your 1000s of gVCFs. Currently, it will perform fairly will for exome studies of 1000s of samples, but not too well for genome. That said, we are working on a new version that will easily scale to 1000s for WGS.
Thanks @Aaronquinlan: I will test it with ~8000 combined gVCFs from WES and will update how it goes with the current version.