Hi,
I have performed a multisample variant calling on 81 non-human mammal samples using GATK. I wanted to do allele-specific expression analysis of 4 samples out of the 81 for which I have RNAseq data. For doing that, I selected only biallelic heterozygous variants from those 4 samples from the multisample VCF file to make separate VCFs for each sample and filtered variants whose GQ (Genotype quality) Phred score < 40.
When I compared the statistics of the remaining variants, the number of variants remaining in chromosome 1 of one sample was in the range of 300000 to 400000, but the other samples had >500000 variants remaining in chromosome 1.
Can there be a biological explanation to this? What all additional analyses can be done to explain this difference? Kindly let me know if any more information is needed for this question.
Yes, it may help to show your sequence of commands.
Ok. So, after obtaining the multisample VCF file after adapting the GATK multisample variant calling pipeline, only biallelic SNPs were obtained using this command:
Then I performed hard-filtering on the multisample VCF file to remove potential false positives (cannot perform soft-filtering due to the absence of truth sets)
Variant annotation was done after this:
Sample-specific VCFs obtained using the following command for 4 samples:
Heterozygous variants from the 4 samples with GQ >=40 was obtained using this command:
I hope this is ok. Kindly let me know if any more information is needed.
P.S. bcftools version: 1.6, GATK version: 4.0.4.0, the 4 samples are from the same species and breed
Just a hint for the
bcftools
command: There is no need to pipe tobgzip
. Just use-o
to define the output name.bcftools
recognize the file extension and will compress on its own. Also you can use-O
to define explicit how the output should looke like.fin swimmer
Thanks a lot for the hint! :-)
Hello,
have you checked if there are more low depth regions for the sample with less remaining variants? Thsi could cause that there were less variants called.
fin swimmer
Just checked that after running 'samtools depth' with -a option on only chromosome 1 for all 4 samples. There are more low depth regions for the sample with less remaining variants. There were 39118 more loci with 0 depth than the sample having the second highest number of loci with 0 depth. It was difficult for me to make a distribution graph for each depth file as the files are huge. So, I just counted the number for '0' for each depth file using 'grep'.