Question

Compare genotypes between WGS and Targeted pannel

0

Entering edit mode

7.0 years ago

mpinsach • 0

Hi all,

I am trying to compare the genotypes between two human cohorts: one sequenced by whole genome sequencing (50X) and another one sequenced using a custom panel (250X). I performed a t-distributed stochastic neighbor embedding (t-SNE) analysis and the two populations look perfectly clustered in two different groups.

I suspect the difference in clustering might be due to the usage of different technologies (WGS and target sequencing).

The DNA was sequenced in an Illumina platform and the SNVs were called using GATK HaplotypeCaller and they were recallibrated for both populations. However, the mean total variants per sample is higher in the targeted sequenced cohort.

I created a matrix of 0/1 for absence/presence of variants in each genomic position reported on the VCF file from the WGS and Target cohorts, as shown in the example below:

                Sample1 Sample2 Sample3
chr3:37428076   0   1   0

I created a final SNVs list by adding the cohort-specific coordinates to the other cohort to have the same number of coordinates.

Does anyone know how to perform this kind of comparison?

Thank you.

genome SNP • 1.4k views

ADD COMMENT • link updated 7.0 years ago by Kevin Blighe 88k • written 7.0 years ago by mpinsach • 0

score 0 · Answer 1 · 2018-01-09

0

Entering edit mode

7.0 years ago

Kevin Blighe 88k

You should:

filter the datasets so that only common variants are included
Normalise the VCFS / BCFs (bcftools norm -m-any)
merge everything together
Read the data into PLINK and check samples against 1000 genomes ( see Produce PCA bi-plot for 1000 Genomes Phase III in VCF format )
Run the comparisons in PLINK (e.g. logistic regression)

I do not know anything about sample numbers, disease state, or ethnicity, so, cannot provide specifics for tests.

ADD COMMENT • link 7.0 years ago by Kevin Blighe 88k

0

Entering edit mode

Dear Kevin,

thank you so much for your reply. I waited to write you back until I tried your suggestions myself.

I followed all your suggestions as well as your post Produce PCA bi-plot for 10000 Genomes Phase III in VCF format [1] but I got stuck after pruning variants from each chromosome from 1000 Genomes. I also don't know how to merge my cohorts file with the 1000 Genomes to be compared in PLINK.

Regarding the sample specifics, the wgs cohort is composed by 200 healthy individuals while the targeted sequencing cohort is composed by 91 cardiac-diseased individuals. Both cohorts are caucasian. Although one comes from America and the other from Spain.

Thank you.

ADD REPLY • link 7.0 years ago by mpinsach • 0

0

Entering edit mode

Would Spanish be considered Caucasian or Hispanic? The idea of merging with 1000 Genomes is to specifically gauge the influence of ethnicity in your cohort. Without correcting for ethnicity, you may make false-associations.

You should, in that case, merge your 2 datasets together, and then merge with 1000 Genomes.

Are you receiving any error message?

ADD REPLY • link 7.0 years ago by Kevin Blighe 88k

0

Entering edit mode

In the clinical information I received from the Spanish individuals was Caucasian ethnicity.

For my two cohorts I did the following:

Filter the datasets so that only common variants are included. I did it with GATK but I first had to remove multiallelic sites.
Merge everything together with vcf-merge option

Then I followed your instructions from your post "Produce PCA bi-plot for 10000 Genomes Phase III in VCF format" but I don't know in which step I should mix the 1000 Genomes with my merged cohorts and how I should do it.

ADD REPLY • link 7.0 years ago by mpinsach • 0