Question

How does one compare allele frequencies of sequenced cases to database controls (e.g. 1000 Genomes)?

0

Entering edit mode

9.2 years ago

morgen ▴ 10

Hi,

I understand that comparing own sequence data with unselected controls from ancestry-matched population examined by projects such as the 1000 Genomes and HapMap can give an indication as to whether there is any variation in allele frequencies, provided that the disease is rare and these populations can be considered as unselected controls. Does anyone have any recommendation on how to carry out this comparison? At the moment I have allele frequencies and functional prediction (e.g. synonymous or nonsynonymous). When does one decide to sequence a set of controls? What are the considerations one has to keep in mind when comparing own data for cases with publicly available control data as opposed to own genotyped controls?

Many thanks,

Silvia

complex-disorders case-control genetic-disorders • 4.3k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.2 years ago by morgen ▴ 10

0

Entering edit mode

Hi Silvia,

Can you elaborate on your question. Is your data coming from SNP arrays or Exome sequencing?

ADD REPLY • link 9.2 years ago by reza.jabal ▴ 580

0

Entering edit mode

Hi Reza,

Thank you for your reply. I genotype my cases by Sanger sequencing, so I can identify both rare and common variants in the coding regions of a gene of interest

ADD REPLY • link 9.2 years ago by morgen ▴ 10

Ram · Answer 1 · 2015-10-15

Silvia, as you noticed we are not able to use database frequencies as our control measure as it give rise to the "population stratification" problem. In fact individuals genotyped in those large databases are not true representative of your study population plus you would never be able to make sure that those genotyped individuals were disease free and therefore you won't be able to establish a true association in the context of your disease/phenotype of interest. Having all that said, however, you would be able to take advantage of large databases in order to narrow down your region of interest to few polymorphisms/SNVs in the gene and only type those regions in your controls which is a great saving in your sequencing cost and effort (especially if your gene is big!). To this end, you may be interested to:

Compile a list of pathological variants in the gene that underlie your disease of interest. For this, you can query ClinVar or LOVD.
Next, you may be interested to include rare variants (MAF<1%) in your list and short list deleterious variants according to their functional impact (e.g. a mutation in a conserved domain that impair essential residue of a key TF/signaling element). You can simply do this by browsing rare variants in Exome variant server or Exome aggregation consortium. In EVS you can decide about deliriousness of a variant according to it's degree of conservation (GERP score) or its functional impact (Phylop and Grantham score).

Bear in mind that, these details just give you an idea about the locus and do not necessarily direct you to the causative variant. You might be also interested to do some intra-population analysis for which I believe you find the Arlequin software handy.

I am also enclosing some useful papers that I hope will guide you thoroughly in the field:

Ram · Answer 2 · 2015-10-14

Ok Silvia, as your data come from Sanger sequencing, I assume you're only considering a small portion of the genome (as opposed to exome/whole genome sequencing), in that case I think simple genetic association study would be a practical approach (however there are numerous alternatives):

Select "Tag SNPs" in your region of interest and identify representative haplotypes in your population for that region, this could be easily done in PLINK.
Next, if you could workout haplotype frequencies in control group (disease free cohort from similar population) you'd be able to perform statistical analysis.

The problem with large databases like 1000 genomes/Exac/ESP or HapMap project (to be used as your control reference) is that sometimes you encounter population/ethnic specific polymorphisms that are quite common in your study population but are flagged as rare in those databases, therefore it is always essential to have a control group.

This paper from the 1000 genome consortium provides a good insight into inter and intrapopulation variability of human genome: http://www.nature.com/nature/journal/v526/n7571/full/nature15393.html