Dear All,
I have performed variant calling analysis for 24 samples using GATK pipeline and generated a single VCF with 24 samples in it. I need some clarifications on following things
1) If I generate single VCF file for each of the 24 samples individually and then generate a single VCF file containing all 24 samples,
- Are there any differences between them in the output VCF?
- if yes, what are the differences?
The reason why I am asking this is, I have family level information and also symptom level information for those 24 samples.
Family level information for those 24 samples
FamilyA : Sample1, Sample2, Sample3
FamilyB : Sample4, Sample5, Sample6
….
FamilyH : Sample22, Sample23, Sample24
Symptom level information for those 24 samples
Joint pain : Sample1, Sample 4, Sample 14, Sample 15, Sample,16, Sample17
Bleeding : Sample2, Sample5, Sample6
Symptom X : …..
For instance,
- I would like to know whether the samples that are grouped together in the above scenario have any common genetic variants among them. In other words, are there 'secondary' variants elsewhere in the exome (other than the X gene) that are common amongst patients that suffer from the same symptoms.
- I want to find common variants for the bleeding symptom, does the common variants differ between the case1 and case2 or not?
case1: I am comparing individual VCF file (sample2.vcf, sample5.vcf and sample6.vcf) and filtering the common variants
case2: I am extracting just the sample2, sample5, and sample6 from the single VCF file with all 25 samples in it
- As the above example, I would like to find common variants at the family level as well.
The differences will be in INFO column (especially with AC, AN etc. tags). The combined VCF will have aggregated statistics for those tags. Other than that, I don't think there would be any differences.
Currently, I am generating the individual vcf files. Once it is complete, I will update you.