Question

Which reference genome for the annotation of the 1KG variant data by snpEff?

0

Entering edit mode

6.6 years ago

Nam Le Quang ▴ 70

Hello,

I downloaded the 1KG data from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/. Then I used snpEff v4.3t to annotate the variants from the data using two different reference genomes GRCh37.75 and hg19. I got the two full statistics files:

hg19 reference genome

GRCh37.75 reference genome

As you can see, the "Number of effects by type and region" statistics are very different for the two reference genomes. I am very confused to choose which reference genome to use. Any help would be appreciated.

Thank you very much!

Edited: specified the "Number of effects by type and region" statistics.

reference genome snpEff variant annotation • 2.3k views

ADD COMMENT • link 6.6 years ago by Nam Le Quang ▴ 70

0

Entering edit mode

My bad!

I did not read the manual carefully. Using UCSC's "hg19" genomes can create consistency problems. The author suggests using ENSEMBL's GRCh versions instead:

UCSC genomes provide only major release version, but NOT sub-versions. E.g. UCSC's "hg19" has major version 19 but there is no "sub-version", whereas ENSEMBL's GRCh37.70 clearly has major version 37 and minor version 70. Not providing a minor version means that they might change the database and two "hg19" genomes are actually be different. This creates all sorts of consistency problems (e.g. the annotations may not be the same that you see in the UCSC genome browser, even though both of them are 'hg19' version). Using UCSC genome tables is highly discouraged, we recommend you use ENSEMBL versions instead.

ADD REPLY • link 6.6 years ago by Nam Le Quang ▴ 70

score 1 · Answer 1 · 2019-05-13

1

Entering edit mode

6.6 years ago

finswimmer 16k

Hello Nam Le Quang ,

which part of the statistic do you find "very different"? There are some differences in the part whether the variants are in the UTR, intronic, etc. But I find them quite similar.

The difference between "hg19" and "GRCh37" that snpEff uses are the transcripts used for annotation. "hg19" uses transcripts provided by UCSC and "GRCh37" uses transcripts provided by ensembl. It's your decision which source you like to use. My personal favorite would be ensembl.

fin swimmer

ADD COMMENT • link 6.6 years ago by finswimmer 16k

0

Entering edit mode

Thank finswimmer!

The "Number of effects by type and region" statistic parts are very different. I suppose to keep variants that affect the protein functions as well as the regulation of gene expressions, so I considered these effects (use GRCh37.75 as reference database):

5_prime_UTR_premature start_codon_gain_variant (109,216 variants)
TFBS_ablation (240 variants)
TF_binding_site_variant (110,349 variants)
bidirectional_gene_fusion (650 variants)
conservative_inframe_deletion (2,266 variants)
conservative_inframe_insertion (1,158 variants)
disruptive_inframe_deletion (4,110 variants)
disruptive_inframe_insertion (1,479 variants)
exon_loss_variant (20 variants)
frameshift_variant (8,174 variants)
gene_fusion (290 variants)
initiator_codon_variant (483 variants)
inversion (593 variants)
missense_variant (1,843,093 variants)
non_canonical_start_codon (8 variants)
protein_protein_contact (5,311 variants)
rare_amino_acid_variant (1 variant)
splice_acceptor_variant (18,100 variants)
splice_donor_variant (24,102 variants)
start_lost (4,187 variants)
stop_gained (35,532 variants)
stop_lost (2,169 variants)
structural_interaction_variant (234,979 variants)

ADD REPLY • link 6.6 years ago by Nam Le Quang ▴ 70