Hi all,
I am trying to compare the genotypes between two human cohorts: one sequenced by whole genome sequencing (50X) and another one sequenced using a custom panel (250X). I performed a t-distributed stochastic neighbor embedding (t-SNE) analysis and the two populations look perfectly clustered in two different groups.
I suspect the difference in clustering might be due to the usage of different technologies (WGS and target sequencing).
The DNA was sequenced in an Illumina platform and the SNVs were called using GATK HaplotypeCaller and they were recallibrated for both populations. However, the mean total variants per sample is higher in the targeted sequenced cohort.
I created a matrix of 0/1 for absence/presence of variants in each genomic position reported on the VCF file from the WGS and Target cohorts, as shown in the example below:
Sample1 Sample2 Sample3
chr3:37428076 0 1 0
I created a final SNVs list by adding the cohort-specific coordinates to the other cohort to have the same number of coordinates.
Does anyone know how to perform this kind of comparison?
Thank you.
Dear Kevin,
thank you so much for your reply. I waited to write you back until I tried your suggestions myself.
I followed all your suggestions as well as your post Produce PCA bi-plot for 10000 Genomes Phase III in VCF format [1] but I got stuck after pruning variants from each chromosome from 1000 Genomes. I also don't know how to merge my cohorts file with the 1000 Genomes to be compared in PLINK.
Regarding the sample specifics, the wgs cohort is composed by 200 healthy individuals while the targeted sequencing cohort is composed by 91 cardiac-diseased individuals. Both cohorts are caucasian. Although one comes from America and the other from Spain.
Thank you.
Would Spanish be considered Caucasian or Hispanic? The idea of merging with 1000 Genomes is to specifically gauge the influence of ethnicity in your cohort. Without correcting for ethnicity, you may make false-associations.
You should, in that case, merge your 2 datasets together, and then merge with 1000 Genomes.
Are you receiving any error message?
In the clinical information I received from the Spanish individuals was Caucasian ethnicity.
For my two cohorts I did the following:
Then I followed your instructions from your post "Produce PCA bi-plot for 10000 Genomes Phase III in VCF format" but I don't know in which step I should mix the 1000 Genomes with my merged cohorts and how I should do it.