Question

Filtering results difference amongst VCF files

0

Entering edit mode

6.1 years ago

prasundutta87 ▴ 670

Hi,

I have performed a multisample variant calling on 81 non-human mammal samples using GATK. I wanted to do allele-specific expression analysis of 4 samples out of the 81 for which I have RNAseq data. For doing that, I selected only biallelic heterozygous variants from those 4 samples from the multisample VCF file to make separate VCFs for each sample and filtered variants whose GQ (Genotype quality) Phred score < 40.

When I compared the statistics of the remaining variants, the number of variants remaining in chromosome 1 of one sample was in the range of 300000 to 400000, but the other samples had >500000 variants remaining in chromosome 1.

Can there be a biological explanation to this? What all additional analyses can be done to explain this difference? Kindly let me know if any more information is needed for this question.

SNP filtering GATK variants • 1.9k views

ADD COMMENT • link 6.1 years ago by prasundutta87 ▴ 670

0

Entering edit mode

Yes, it may help to show your sequence of commands.

ADD REPLY • link 6.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Ok. So, after obtaining the multisample VCF file after adapting the GATK multisample variant calling pipeline, only biallelic SNPs were obtained using this command:

bcftools view -v snps -m2 -M2 --threads 8 <multisample_VCF_file>|bgzip --threads 8 ><biallelic_gzipped_VCF>

Then I performed hard-filtering on the multisample VCF file to remove potential false positives (cannot perform soft-filtering due to the absence of truth sets)

bcftools filter -i '<GATK recommended hard-filtering parameters>' --threads 8 <biallelic_gzipped_VCF>|bgzip --threads 8 > <biallelic_filtered_gzipped_VCF>

Variant annotation was done after this:

java -jar snpEff.jar -v <reference_genome> <biallelic_filtered_gzipped_VCF>|bgzip ><biallelic_filtered_gzipped_VCF_snpeff>

Sample-specific VCFs obtained using the following command for 4 samples:

java -jar gatk-package-4.0.4.0-local.jar SelectVariants -R <reference_genome> -V <biallelic_filtered_gzipped_VCF_snpeff> -O <Sample_1_biallelic_filtered_gzipped_VCF_snpeff> -sn <Sample_1>

Heterozygous variants from the 4 samples with GQ >=40 was obtained using this command:

bcftools view -g het -i 'GQ >= 40' <Sample_1_biallelic_filtered_gzipped_VCF_snpeff> > <Sample_1_biallelic_het_GQ40.vcf>

I hope this is ok. Kindly let me know if any more information is needed.

P.S. bcftools version: 1.6, GATK version: 4.0.4.0, the 4 samples are from the same species and breed

ADD REPLY • link 6.1 years ago by prasundutta87 ▴ 670

0

Entering edit mode

Just a hint for the bcftools command: There is no need to pipe to bgzip. Just use -o to define the output name. bcftoolsrecognize the file extension and will compress on its own. Also you can use -O to define explicit how the output should looke like.

fin swimmer

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

Thanks a lot for the hint! :-)

ADD REPLY • link 6.1 years ago by prasundutta87 ▴ 670

0

Entering edit mode

Hello,

have you checked if there are more low depth regions for the sample with less remaining variants? Thsi could cause that there were less variants called.

fin swimmer

ADD REPLY • link 6.1 years ago by finswimmer 16k

0

Entering edit mode

Just checked that after running 'samtools depth' with -a option on only chromosome 1 for all 4 samples. There are more low depth regions for the sample with less remaining variants. There were 39118 more loci with 0 depth than the sample having the second highest number of loci with 0 depth. It was difficult for me to make a distribution graph for each depth file as the files are huge. So, I just counted the number for '0' for each depth file using 'grep'.

ADD REPLY • link 6.1 years ago by prasundutta87 ▴ 670