Question

Low mean of read depth - filtering SNP in VCF file

0

Entering edit mode

5.5 years ago

Trevorsky • 0

I have a question, I have two WGS datasets for cattle. I am looking for a data filtering protocol to leave only reliable SNP. I found several publications where the SNP in the immediate neighbourhood was filtered (e. g. in three consecutive positions) and the SNPs that were outside the 3 standard deviations from the mean or median were rejected. I have a very low DP average in my data - DP is about 12 and SD is about 6 - which excludes this method.

I have been looking for information in publications for two weeks now, but it is quite residual and does not lead to any meaningful protocol.

I especially looking for publications or protocols for BASH or vcf/bcftools.

snp vcf filtering • 2.9k views

ADD COMMENT • link updated 5.5 years ago by joneill4x ▴ 160 • written 5.5 years ago by Trevorsky • 0

score 1 · Answer 1 · 2019-11-07

I would recommend processing your data with GATK and follow their best practices.

Since your species is cattle, you will 'hard-filter' the variant calls. https://gatkforums.broadinstitute.org/gatk/discussion/2806/howto-apply-hard-filters-to-a-call-set

Once GATK has output the final, hard-filtered VCF file, you can eliminate any genotype call with a Genotype Quality Score less than a specified threshold. We use a threshold of 20 (99% accuracy). The Genotype Quality Score is a value on the Phred scale.

Once we started to use the GATK pipeline, the quality of our variant calls increase dramatically. Here are some example commands:

// prepare reference genome

java -jar picard.jar CreateSequenceDictionary R= F3bR3b.fasta O= F3bR3b.dict

bwa index F3bR3b.fasta

samtools faidx F3bR3b.fasta

// align reads form one sample to the ref genome

bwa mem -M -R '@RG\tID:HA93071\tSM:POOL2' -t 16 HA93071_gen.fasta POOL2_S2_L001_R1_001.fastq.gz POOL2_S2_L001_R2_001.fastq.gz | samtools sort -@ 8 -T temp -o HA93071_sorted.bam

samtools index HA93071_sorted.bam

// skipped the MarkDuplicates step

// skipped the Base Quality Score Recalibration step

// HaplotypeCaller step

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T HaplotypeCaller -R F3bR3b.fasta -I F3BR3b_sorted.bam --emitRefConfidence GVCF -o F3bR3b.g.vcf -variant_index_type LINEAR -variant_index_parameter 128000 -nct 16

// GenotypeGVCFs step

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -R F3bR3b.fasta --variant F3bR3b.g.vcf -o F3bR3b.vcf -nt 1

// Variant Filtration step

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T VariantFiltration -R F3bR3b.fasta -V F3bR3b.vcf --filterExpression "QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "deleteThis" -o badsMarked.vcf

// Select variants step, only keep SNPs that pass the filtering criteria

java -jar /home/joneill/bin/GenomeAnalysisTK.jar -T SelectVariants -R F3bR3b.fasta --variant badsMarked.vcf -o F3bR3b_filtSNPs.vcf -select 'vc.isNotFiltered()' --selectTypeToInclude SNP

// convert final VCF to more readable format, mask any genotype calls with a genotype quality score < 20

perl /mnt/nas/JoesBackup/gatkVCFtoHapMap.pl F3bR3b_filtSNPs.vcf 20 3 F3bR3b.hmp.txt F3bR3b.hmc.txt