Tool:Gemini: Integrative Exploration Of Genetic Variation And Genome Annotations
5
14
Entering edit mode
11.6 years ago

We have developed a new tool called GEMINI (GEnome MINIng) to facilitate the exploration of genetic variation in the context of a wide range of genome annotations that are crucial to interpretation and prioritization. Unlike existing tools, GEMINI integrates genetic variation with a diverse and flexible set of genome annotations (e.g., dbSNP, ENCODE, UCSC, ClinVar, KEGG) into a unified database to facilitate interpretation and data exploration.

By loading both genetic variants in VCF format and genome annotations into a unified SQLite database, GEMINI allows researchers to compose complex queries based on sample genotypes, inheritance patterns, and both pre-installed and custom genome annotations. GEMINI also provides methods for ad hoc queries and data exploration, a simple programming interface for custom analyses that leverage the underlying database, and both command line and graphical tools for common analyses. GEMINI is well-suited to exploring variation in personal genomes and family based genetic studies, and it scales to studies involving thousands of human samples. GEMINI is designed for reproducibility and flexibility and our goal is to provide researchers with a standard framework for medical genomics.

Source code || Documentation || Manuscript || Overview Presentation || Installation || Mailing list

The GEMINI project was conceived in the Quinlan lab, but it has also benefited from fantastic collaborations with Brad Chapman, Rory Kirchner, and Oliver Hofmann at the Harvard School of Public Health

To get started with GEMINI, one needs a valid VCF file based on Human Genome coordinates from Build 37 (hg19) of the human genome. We expect that you have annotated with VCF with either snpEff (instructions here) or VEP (instructions here). You then simply load the VCF into GEMINI with the load command. This populates a GEMINI database with the variants and automatically annotates variants all built-in annotations.

# assumes VCF has been annotated by snpEff
$ gemini load -v my.vcf -t snpEff my.gemini.db

One can also provide a PED file to define relationships among samples (useful for finding variants that meet expected inheritance patterns) and to define the sex and disease status of the samples.

$ gemini load -v my.vcf -t snpEff -p my.ped my.gemini.db

Loading is very computationally expensive; therefore, the work can easily be distributed among either multiple CPUs on a single machine:

$ gemini load --cores 8 -v my.vcf -t snpEff my.gemini.db

or distributed on a computing cluster that leverages either SGE, LSF or Torque:

# LSF
$ gemini load --cores 128 --lsf-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db

# SGE
$ gemini load --cores 128 --sge-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db

# Torque
$ gemini load --cores 128 --torque-queue my_bigbad_queue -v my.vcf -t snpEff my.gemini.db

Once loaded, one can begin exploring genetic variation using either the "query" interface (see here for more details):

$ gemini query -q "select chrom, start, end, ref, alt from variants \
                  where is_lof = 1 \
                  and aaf >= 0.01" my.gemini.db

In particular, see the section on accessing and filtering upon sample genotype information.

For example, to select genotypes for a specific sample (sample1):

$ gemini query -q "select chrom, start, end, ref, alt, gts.sample1 from variants \
                  where is_lof = 1 \
                  and aaf >= 0.01" my.gemini.db

One can also apply genotype filters with the gt-filter option. This will return only those variants that meet the specific genotype criteria you enforce. Here is an example of a filter that enforces an autosomal recessive inheritance pattern. Note that these patterns follow Python syntax.

$ gemini query -q "select chrom, start, end, ref, alt, gts.mom, gts.dad, gts.kid from variants \
                            where is_lof = 1 and aaf >= 0.01" \
               --gt-filter "gts.dad == HET and gts.mom == HET and gts.kid == HOM_ALT" \
               my.gemini.db

In addition, there are many built-in tools for conducting common analyses and finding variants that meet inheritance patterns that make sense for the phenotype you are studying. Please see here for more details.

Find de novo variants

$ gemini de_novo my.gemini.db

Find variants meeting an autosomal recessive inheritance pattern

$ gemini autosomal_recessive my.gemini.db

Find variants meeting an autosomal dominant inheritance pattern

$ gemini autosomal_dominant my.gemini.db

Lastly, we see GEMINI as a framework for researchers to develop their own new tools, and methods. We see the GEMINI database as the "API" and given that SQLite databases are portable, the code you develop based upon the Python API will work on any GEMINI database.

from gemini import GeminiQuery
gq = GeminiQuery("my.db")

gq.run("select chrom, start, end from variants")
for row in gq:
    print row

We are constantly adding features, yet if there is something you would like to see added, please let us know (preferably using the mailing list).

vcf database genome variation • 9.3k views
ADD COMMENT
0
Entering edit mode

@Aaronquinlan: Thanks for sharing: looks interesting tool; need to try...

ADD REPLY
0
Entering edit mode

I am adding this to our home grown LIMS!

ADD REPLY
0
Entering edit mode

Please let us know if you have any troubles or suggestions.

ADD REPLY
0
Entering edit mode

I am interested in using GEMINI to store and annotate CNVs. However, am I reading the documentation right in that only the most highly affected transcript would be stored? Perhaps I could still store the CNVs and some annotation in the SQLite DB and annotate for gene overlap on the fly...

ADD REPLY
0
Entering edit mode

The impact on othr transcripts is stored in the variant_impacts table.

ADD REPLY
0
Entering edit mode

How do you see Gemini can handle population (1000s of gVCFs) level human WES or WGS data?

ADD REPLY
0
Entering edit mode

The input to GEMINI is a single VCF, which can be created by combining your 1000s of gVCFs. Currently, it will perform fairly will for exome studies of 1000s of samples, but not too well for genome. That said, we are working on a new version that will easily scale to 1000s for WGS.

ADD REPLY
0
Entering edit mode

Thanks @Aaronquinlan: I will test it with ~8000 combined gVCFs from WES and will update how it goes with the current version.

ADD REPLY
1
Entering edit mode
11.4 years ago

The manuscript for GEMINI is available at: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003153 And there is a high level description of it in this video: http://www.youtube.com/watch?feature=player_embedded&v=p-UWmDG6yj4

ADD COMMENT
0
Entering edit mode

Hi Sir,

Can you please tell me how can I call gemini api through the windows machine, from scratch like gemini load command API and querying database api etc.

Thanks,
Nilesh

ADD REPLY
0
Entering edit mode

please don't ask questions in the comment section of a post. Post a new question if you have one.

ADD REPLY
0
Entering edit mode
11.6 years ago

Just an idea: there is this user friendly query language called HTSQL that might be very suited to opening up GEMINI to less technically inclined people.

HTSQL a comprehensive navigational query language for relational databases.

ADD COMMENT
0
Entering edit mode

Looks interesting, though it seems geared towards making SQL less "hard". I see SQL as one of the most intuitive languages around...most biologists that I know who lack programming skills find SQL easy to understand. Have you seen otherwise?

ADD REPLY
0
Entering edit mode

Simple selects have a easy syntax that would be hard to improve on. But once you have joins and grouping it gets very unforgiving and mistakes are hard to spot. I have quite a hard time building these myself if I haven't used SQL in a while. Usually I feel that I need to retrain myself after a few months of not doing SQL.

Compare the two, the HTSQL:

/department{name, max(course.credits)}

versus direct SQL:

SELECT "department"."name",
       "course"."max"
FROM "ad"."department"
     LEFT OUTER JOIN (SELECT MAX("course"."credits") AS "max",
                             "course"."department_code"
                      FROM "ad"."course"
                      GROUP BY 2) AS "course"
                     ON ("department"."code" = "course"."department_code")
ORDER BY "department"."code" ASC
LIMIT 10000
ADD REPLY
0
Entering edit mode
11.6 years ago

Are there any plans to abstract the job queueing interface with something like DRMAA? The *-queue flags seem a bit redundant.

http://www.drmaa.org/

ADD COMMENT
0
Entering edit mode

Thanks. I was unaware of this. We are currently using IPython-parallel to handle the distributed computing. DRMAA may be an option but I need to spend some time reading up on it.

ADD REPLY
0
Entering edit mode
11.2 years ago

I wonder if there are plans (if not supported somehow already), to support coverage, like Chanjo does:

https://chanjo.readthedocs.org/en/latest/

ADD COMMENT
0
Entering edit mode

Chanjo looks very interesting...we will look into it.

ADD REPLY
0
Entering edit mode
7.2 years ago

Is there any way to make gemini work with hg38 ?

ADD COMMENT

Login before adding your answer.

Traffic: 1615 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6