DE novo call from trio
1
3
Entering edit mode
5.4 years ago

Hi!

I had done

  • Alignment
  • Sorting sam file
  • Mark duplicates
  • Realignment call variants
  • BQSR
  • Call raw snps and Apply hard filtering

Then I merge my three vcf files (dad, mum, child)

 java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg38.fa --variant sample02.vcf --variant sample06.vcf --variant sample08.vcf -o 268merged.vcf -genotypeMergeOptions UNIQUIFY

and apply next genotype refinement workflow

Im wondering if its common to dont get any results as hiConfDeNovo and lowConfDeNovo in the results (raw snps or after hard filtering)

 INFO=<ID=hiConfDeNovo,Number=1,Type=String,Description="High confidence possible de novo mutation (GQ >= 20 for all trio members)=[comma-delimited list of child samples]">
 INFO=<ID=loConfDeNovo,Number=1,Type=String,Description="Low confidence possible de novo mutation (GQ >= 10 for child, GQ > 0 for parents)=[comma-delimited list of child samples]">
CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample02.variant    sample06.variant2   sample08.variant3
1   917495  .   C   T   30.74   PASS    AC=2;AF=1.00;AN=2;DP=2;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;PG=0,0,0;QD=15.37;SOR=0.693;set=variant3   GT:AD:DP:FT:GQ:PL:PP    ./.:.:.:PASS    ./.:.:.:PASS    1/1:0,2:2:lowGQ:6:58,6,0:58,6,0

Thanks :)

exome • 3.8k views
ADD COMMENT
1
Entering edit mode

Its possible. But I think following change would help.

  1. After BQSR, create g.vcf files for each sample using HaplotypeCaller -ERC GVCF option.
  2. Create a combined genotype using GenotypeGVCFs
  3. Separate SNPs and Indels using SelectVariants and perform hard filtration
  4. Combine SNPs and Indels using CombineVariants

Can you confirm that following steps were followed after hard filtration?

  1. Perform genotype refinement using CalculateGenotypePosteriors. A ped file will have to be provided.
  2. Filter low quality genotypes using VariantFiltration
  3. Annotate de novos using VariantAnnotator using -A PossibleDeNovo option. Again ped file will have to be provided.
ADD REPLY
0
Entering edit mode
5.4 years ago

Thanks for the help Khare :)

Its does really matter if you use gvcf format instead vcf? I thought was the same. Will try with -ERC GVCF option. and see if I get any different result. Thanks for the advice.

If I use Varscan using mpileup at the end I get similar results. I am just a bit surprised didnt appear any low or high De novo variants using Gatk.

"Can you confirm that following steps were followed after hard filtration?"

I do. I perform this pipeline after doing hard filtration (I only select snps variants atm Im not interested on indels)

CombineVariants

java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg38.fa --variant 04PASS.vcf --variant 06PASS.vcf --variant 08PASS.vcf -o 468mergedPASS.vcf -genotypeMergeOptions UNIQUIFY

Convert to biallelic, If I dont perform those steps gatk gives me error.

java -jar GenomeAnalysisTK.jar -T SelectVariants -R hg38.fa --variant 468mergedPASS.vcf -restrictAllelesTo BIALLELIC -o bi468merged.vcf

java -jar GenomeAnalysisTK.jar -T SelectVariants -R hg38.fa --variant 1000G_phase1.snps.high_confidence.b37.vcf.gz -restrictAllelesTo BIALLELIC -o 1000G_phase1.snps.high_confidence.b37_BIALLELIC_ONLY.vcf.gz

Step 1: Derive posterior probabilities of genotypes

java -jar GenomeAnalysisTK.jar -R hg38.fa -T CalculateGenotypePosteriors --supporting 1000G_phase1.snps.high_confidence.b37_BIALLELIC_ONLY.vcf.gz -ped fam10.ped -V bi468merged.vcf -o recalibratedVariants.postCGP.vcf

Step 2: Filter low quality genotypes

java -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg38.fa -V recalibratedVariants.postCGP.vcf -G_filter "GQ < 20.0" -G_filterName "lowGQ" -o recalibratedVariants.postCGP.Gfilteredcall.vcf

Step 3: Annotate possible de novo mutations

java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R hg38.fa -V recalibratedVariants.postCGP.Gfilteredcall.vcf -A PossibleDeNovo -ped fam10.ped -o recalibratedVariants.postCGP.Gfiltered.deNovos.vcf
ADD COMMENT
1
Entering edit mode

No, gvcf wont change the results. Just that adding samples later will be easier. I am not sure why "convert to biallelic" step is necessary. I also worked only on SNPs but did not run this command. Everything else looks fine. What were the hard filtration parameters?

Important: You may want to use the same genome version for all files. I mean the genome fasta file should also be from the b37 bundle.

Best,

ADD REPLY
0
Entering edit mode

Ok, thanks

"I am not sure why "convert to biallelic" step is necessary."

I followed the tips at gatk forum

CalculateGenotypePosteriors error

"What were the hard filtration parameters?"

I used this pipeline:

https://gencore.bio.nyu.edu/variant-calling-pipeline/

 java -jar GenomeAnalysisTK.jar -T VariantFiltration -R ref -V raw_snps.vcf --filterExpression 'QD < 2.0 || FS > 60.0 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0 || SOR > 4.0' --filterName "basic_snp_filter" -o filtered_snps.vcf

.

"Important: You may want to use the same genome version for all files. I mean the genome fasta file should also be from the b37 bundle."

Im bit confuse at that point, when I did my alignment I used

wget ftp://ftp.ensembl.org/pub/release-97/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.chromosome.*.fa.gz

I should use instead --supporting 1000G_phase1.snps.high_confidence.b37_BIALLELIC_ONLY.vcf.gz that one?

ADD REPLY
1
Entering edit mode

Oh, its select biallelic. Anyway, as I mentioned, I never had to do this step.

Regarding the last point (and this may get rid of this biallelic issue), you need to use b37 package for all files. Assuming you downloaded the vcf file 1000G_phase1.snps.high_confidence.b37.vcf.gz from this link either using Google cloud or FTP server access, you should download the genome fasta file also from the same folder. It is better to use all files from the same bundle to avoid compatibility issues.

ADD REPLY
0
Entering edit mode

Ok, but If I use the reference genome from ftp.broadinstitute.org/bundle then I asume I should do the alignment over again from the start with my fastq files right? ( I used from start the reference genome from ensembl.org)

Should really affect that point on my results to get high or low denovo variants?

Thanks

ADD REPLY
1
Entering edit mode

Yes, you need to perform the alignment again. I think it will affect the outcome because the coordinates of the vcf file won't match the alignment files in current scenario.

ADD REPLY
0
Entering edit mode

thanks for your time Khare, I really appreciate it :)

I have a last question

I could just use my reference genome from ensembl.org and use as --supporting file ftp://ftp.1000genomes.ebi.ac.uk/../../vol1/ftp/release/20130502/phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.....files?

then my files should have the same compatibility right?

The reason bc I use the reference genome from ensembl is the compatibility for do the annotations with snpEff. In my pc I take a hole day to do alignment so if I can avoid to do over again would save me lot time.

ADD REPLY
0
Entering edit mode

The link did not open, but assuming that you have a vcf file from the same source and corresponding to the same genome version you used for alignment, it should work. But I would still recommend using GATK bundle since you may need some files in future which are exclusively present in the GATK bundle and then there will be compatibility issues.

ADD REPLY
0
Entering edit mode

Hello Khare

sorry, I didnt want to write you back until I perfom again the alignment. this time I used the reference genome from GATK and supporting file from GATK, and run the process from the start. My results are almost the same and dint get any high or lowdenovo variants.

ADD REPLY
0
Entering edit mode

Sorry about the late reply. Did you get an error without "convert to biallelic" step this time too?

ADD REPLY

Login before adding your answer.

Traffic: 1700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6