I want to simulate variant calling of an ultra-low-coverage >0.005x bam file. I subsampled reads from the (HG02024) sample of the 1KG phase 3 dataset.
My code in R to do so is the following (bam
and reference
are just path extensions, file
is the inital bam file):
cov_rate <- 11282/28533515
#subsampling with samtools
system(str_glue("samtools view -s {cov_rate} {file} -o {bam}HG02024_cov_{cov_rate}x.bam"))
#sort the bam file
system(str_glue("samtools sort {bam}HG02024_cov_{cov_rate}x.bam -o {bam}HG02024_cov_{cov_rate}x_sort.bam"))
#index the bam file
system(str_glue("samtools index {bam}HG02024_cov_{cov_rate}x_sort.bam"))
#extract chromosome 1-22 from bam file, everything else is not relevant in the further analysis
system(str_c('samtools view -b ',bam,'HG02024_cov_',cov_rate,'x_sort.bam 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 > ',bam,'HG02024_cov_',cov_rate,'x_sort_chr1_22.bam'))
call = str_glue(' python3 $HOME/gatk/gatk --java-options "-Xmx28g" HaplotypeCaller --disable-tool-default-read-filters --native-pair-hmm-threads 24 -R {reference}/GRCh37/references_hs37d5_hs37d5_chr1_22.fa -I HG02024_cov_{cov_rate}x_sort_chr1_22.bam -O {str_replace(i, ".bam", ".vcf.gz")} -ERC GVCF')
system(call)
But then the HaploTypeCaller delivers this error message when I add "--disable-tool-default-read-filters" which is necessary for the project:
java.lang.NullPointerException
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.readStartsAtCurrentPosition(ReadStateManager.java:117)
at org.broadinstitute.hellbender.utils.locusiterator.ReadStateManager.collectPendingReads(ReadStateManager.java:144)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.lazyLoadNextAlignmentContext(LocusIteratorByState.java:287)
at org.broadinstitute.hellbender.utils.locusiterator.LocusIteratorByState.hasNext(LocusIteratorByState.java:224)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.advanceAlignmentContext(IntervalAlignmentContextIterator.java:104)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.advanceAlignmentContextToCurrentInterval(IntervalAlignmentContextIterator.java:99)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.next(IntervalAlignmentContextIterator.java:69)
at org.broadinstitute.hellbender.utils.locusiterator.IntervalAlignmentContextIterator.next(IntervalAlignmentContextIterator.java:21)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:123)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:115)
at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:35)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:192)
at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:173)
at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1095)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:140)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
The NullPointerException first led me to the conclusion that due to the removal of some non-relevant contigs from the bam file I might need to edit the header file as well. Then I learned due to some other posts on biostar that editing the header of a SAM/BAM file seems to be generally a bad idea. And indeed gave me an corrupted bam file after tried. Then I tried to edit the reference fasta file. Sadly the error from HaploTypeCaller persists.
Does anyone know where this error comes from and how to solve it?
are you using R as a workflow manager ?
https://github.com/broadinstitute/gatk/issues/4665#issuecomment-382367333
@Pierre Lindenbaum: Thanks for your input! Yes, R is my workflow manager indeed. I worked with this low-cov calling pipeline before and followed a guide for ultra-low-coverage calling, though I can't find it anymore. This guide recommended to use this kind of calling strategy to preserve as many reads as possible. Could you extend the argument why this is really such a "big hammer"?
Makefile, nextflow, snakemake, etc... you're re-inventing the wheel without the good parts (parallelism, caching... )