I want to know is it a true method ?
0
0
Entering edit mode
6.7 years ago

Hi I have 50 DNA whole genome seq(bam file) that these samples produced separately (until re calibration step) and Then for variant calling using Unified Genotyper ,I produced one vcf file for all of these samples. I want to know is it a true method ? or I should merge all samples together in a bam file from realignment step until end of variant calling step?

vcf SNP realignment • 2.1k views
ADD COMMENT
0
Entering edit mode

Go through the specific commands that you used, please. Thank you!

ADD REPLY
0
Entering edit mode

Thank you so much. so my method is true.

ADD REPLY
0
Entering edit mode

No. Can you, please, paste (here, in a text box, via the ADD REPLY button) the commands that you executed when you were processing this data? :)

You have not provided enough information for anyone to give an answer.

ADD REPLY
0
Entering edit mode

indel realignment

java –Xmx16g -jar /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -R Equus_caballus.EquCab2.dna.toplevel.fa -T RealignerTargetCreator -o bam1_Realigner.intervals -I bam1  -known equus_caballus.vcf

realignment

java -Xmx16g -jar /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -R Equus_caballus.EquCab2.dna.toplevel.fa -T IndelRealigner -targetIntervals bam1_Realigner.intervals -I bam1   -o bam1.indel.re.bam

make grp

java -Xmx16g -jar  /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -R Equus_caballus.EquCab2.dna.toplevel.fa -T BaseRecalibrator   -I bam1.indel.re.bam -knownSites  equus_caballus.vcf -o bam1.indel.re.grp

recalibration

java -Xmx16g -jar  /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -R Equus_caballus.EquCab2.dna.toplevel.fa -T PrintReads  -BQSR bam1.indel.re.grp -I bam1.indel.re.bam   -o bam1.inde.re.recal.bam
  • I do this method separately per each sample until end of recalibration and then variant calling for all samples together:

variant calling

java -Xmx16g -jar  /GenomeAnalysisTK-3.8-0-ge9d806836/GenomeAnalysisTK.jar -R /Equus_caballus.EquCab2.dna.toplevel.fa   -T UnifiedGenotyper    -glm BOTH -o total_variants.vcf -I bam1.inde.re.recal.bam -I bam2.inde.re.recal.bam -I bam3.inde.re.recal.bam  .......... -I bam50.inde.re.recal.bam
ADD REPLY
0
Entering edit mode

Hello,

there is no one and only "true" way for processing your data. The way can differ due to the sequencing platform, the source of your material, the goal of you experiment and so on.

What you have shown is ok I think, but there is a lot space for improvements. So for example using GATKs HaplotypeCaller or freebayes instead of UnifiedGenotyper. Doing this you can avoid realignment completly.

fin swimmer

ADD REPLY
0
Entering edit mode

thank you so much. you mean when I using HaplotypeCaller or freebayes, I can avoid realignment ?

ADD REPLY
0
Entering edit mode

Yes,

both discard the alignment information within a window around a suspected ins/del and doing a de novo assembly with the reads within this window. This is much more powerful.

Even BQR is questionable. If the vast majority of your sequenced bases have (very) good Q Scores the benefit of it will be negligible.

fin swimmer

ADD REPLY
0
Entering edit mode

Sorry, I got caught up in London rush hour.

As per finswimmer, definitely switch to using HaplotypeCaller.

I also agree with finswimmer's mild skepticism about the BaseQualityScoreRecalibration - I don't believe that it affects much, possibly just a few bases that are on the fringe of poor/good quality. As you have already included it, you do not have to exclude it.

The re-alignment for indels, I believe that you should keep this. I have noticed that it can definitely improve calling of indels when compared to, for example, samtools mpileup.

The other question is surrounding your use of UnifiedGenotyper / HaplotypeCaller and in supplying multiple BAMs. The other option is to call variants separately on each BAM and to then merge the VCFs together by performing joint genotyping.

The GATK's variant calling pipelines are, in my opinion, overly complex, and make something difficult when it should be easy. You should learn 'backup' methods that are different from the GATK's. Obviously, however, you should be generally aware of the 'best practices': https://gatkforums.broadinstitute.org/gatk/discussion/3238/best-practices-for-variant-discovery-in-dnaseq

In any case, here is a quick summary of what to do:

  1. Alignment (non-GATK) - if reads >70bp, use bwa mem; if shorter, use the original bwa algorithm or bowtie
  2. Mark duplicates (Picard)
  3. Local re-alignment around potential indels (GATK)
  4. BQSR (GATK)
  5. Variant calling per sample (GATK HaplotypeCaller)
  6. Sample aggregation and joint genotyping for all samples (GATK GenotypeGVCFs)
  7. Variant recalibration (GATK VariantRecalibrator and ApplyRecalibration)

The final step is definitely not necessary and may even result in you throwing out good calls.

ADD REPLY
0
Entering edit mode

Thank you but my main question is: 1.can I use bam files separately in Local re-alignment and BQSR steps or I should merge these bam files after Mark duplicates? 2.when I do Variant calling separately then I do not have any 0/0 genotype but when I do Variant calling all samples or together (even two samples) then I have 0/0.

ADD REPLY
0
Entering edit mode

Do not merge them for the purposes of making duplicates or re-aligning indels. Treat them as separate samples.

You can merge them during variant calling, in accordance with how the GATK's tools are developed.

ADD REPLY
1
Entering edit mode

thank you so much for useful guidance.

ADD REPLY
0
Entering edit mode

The re-alignment for indels, I believe that you should keep this. I have noticed that it can definitely improve calling of indels when compared to, for example, samtools mpileup.

I know gatk's best practice guides still recommend it, but I cannot understand why this should improve my variant calling, as the HaplotypeCaller doesn't rely on the alignment. Can you explain it?

Variant calling per sample (GATK HaplotypeCaller)

Before starting with the Variant calling or doing anything else with the bam file I would recommend to sort the file by coordinates, as a lot of tools expect it that way. It is annoying to realize that first if you want to test a tool you didn't think of before.

fin swimmer

ADD REPLY
0
Entering edit mode

I would recommend to sort the file by coordinates: which software can do it?

ADD REPLY
1
Entering edit mode

I would recommend to sort the file by coordinates: which software can do it?

samtools sort can do this.

ADD REPLY
0
Entering edit mode

I know gatk's best practice guides still recommend it, but I cannot understand why this should improve my variant calling, as the HaplotypeCaller doesn't rely on the alignment. Can you explain it?

I have seen how the use / non-use of the IndelRealigner does modify calls on indels, and this is when using HaplotypeCaller.

ADD REPLY
0
Entering edit mode

it is my method. is it true?

ADD REPLY

Login before adding your answer.

Traffic: 2452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6