Merge VCF files by chromosome and Across samples
0
0
Entering edit mode
1 day ago
Maverick ▴ 10

Hello,

I had a gatk pipeline setup successfully to call SNPs and Indels. however it was running whole samples and to reduce time I split them by chromosomes (after mapping, sorting and marking duplicates) and went all the way upto calling variants (GVCF mode). now when I try to concat the chromosomes of the same sample I get an error saying it's not sorted (no matter what tool I run vcftools, Picard) and it's particularly when trying to concat chr9 and chr10. it says that 1st position mentioned in chr 10 is before last position of chr9. I manually checked it and it's not the case. however I sorted them (using vcftools) and concatenated (using vcftools). now the concatenation is successful.

Next step was to combing the gvcf file across samples into one for joint calling. I ran combinegvcf and I got this error:

> 20:37:57.292 INFO  ProgressMeter -       chr9:137752186           
> 250.6             858634000        3426981.3 20:38:02.220 INFO  CombineGVCFs - Shutting down engine [October 13, 2024 at 8:38:02 PM
> GMT] org.broadinstitute.hellbender.tools.walkers.CombineGVCFs done.
> Elapsed time: 250.69 minutes. Runtime.totalMemory()=1224736768
> java.lang.IllegalStateException: The elements of the input Iterators
> are not sorted according to the comparator
> htsjdk.variant.variantcontext.VariantContextComparator    at
> htsjdk.samtools.util.MergingIterator.next(MergingIterator.java:107)
>   at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
>   at
> java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1845)
>   at
> java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:509)
>   at
> java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:499)
>   at
> java.base/java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:150)
>   at
> java.base/java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:173)
>   at
> java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at
> java.base/java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:596)
>   at
> org.broadinstitute.hellbender.engine.MultiVariantWalker.traverse(MultiVariantWalker.java:136)
>   at
> org.broadinstitute.hellbender.engine.MultiVariantWalkerGroupedOnStart.traverse(MultiVariantWalkerGroupedOnStart.java:165)
>   at
> org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:1098)
>   at
> org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:149)
>   at
> org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:198)
>   at
> org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:217)
>   at
> org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:166)
>   at org.broadinstitute.hellbender.Main.mainEntry(Main.java:209)  at
> org.broadinstitute.hellbender.Main.main(Main.java:306)\

Again this is only after chr9 in all the samples. Why do I have to sort the combined files again?

I have seen in the logs that the files are usually processed in the order 1,10,11-19,2,20,21-22,3-9,X,Y,M. could the issue be because I am trying to concatenated in the natural order as in 1-22,X,Y M?

Thank you for your time and guidance.

GATK PICARD MergeVcf vcftools • 347 views
ADD COMMENT
0
Entering edit mode

(no matter what tool I run vcftools, Picard)

show us the picard command tool

ADD REPLY
0
Entering edit mode

why calling CombineGVCFs with merged vcfs when you could combine+genotype each chromosome ?

ADD REPLY
0
Entering edit mode

Hi,

My assumption was combine and genotype had to be done for the whole vcf file and so I merged them first. So the joint calling isn't affected if the samples have only specific chromosomes and not all of them?

My pipeline follows this:

Phase 1:

(For each sample in a batch)

  1. Map to reference using BWA-MEM

  2. Mark Duplicates and Sort - SortSam

  3. Mark Duplicates and Sort - MarkDuplicates

  4. Mark Duplicates and Sort - samtools_index

  5. Split BAM by Chromosome & index (using samtools)

    For each chromosome in a sample,

  6. Base Quality Recalibration - BaseRecalibrator

  7. Base Quality Recalibration - ApplyBQSR

  8. Collect Alignment & Insert Size Metrics - CollectAlignmentSummaryMetrics

  9. Collect Alignment & Insert Size Metrics - CollectInsertSizeMetrics

  10. Variant Calling and Filtering - HaplotypeCaller (GVCF mode)


  1. CombineGVCFs per Sample (previously used combinegvcf , gather vcf and all failed with the error like First record in file chr10 is not after first record in previous file chr9)

    so I used vcf tools to sort

>  zcat "$vcf_file" | $vcftools/perl/vcf-sort -c -t "$tmp" | bgzip -c >
> "$sorted_vcf_file"

   and then vcftools to concat

 > $vcftools/perl/vcf-concat $gvcf_sorted_inputs >
 > "$results/${sample_name}_combined.g.vcf"

Phase 2:

  1. CombineGVCFs - combine all sample gvcfs in the batch into one:
  > $gatk CombineGVCFs -R ${ref} $VARIANT_ARGS -O
  > ${results_comb}/$batch_name"_GATK_combined.g.vcf"
  1. GenotypeGVCFs
  2. VariantRecalibrator (SNPs)
  3. ApplyVQSR (SNPs)
  4. VariantRecalibrator (INDELs)
  5. ApplyVQSR (INDELs)
  6. SelectVariants (extract SNPs and INDELS)
  7. VariantFiltration (SNPs)
  8. VariantFiltration (INDELs)
  9. MergeVcfs (SNPs and INDELs)
  10. bcftools (split vcf files per sample)

Regarding the error,

I believe the contig order is wrong in one of my samples. For example when I run gatk sort sam (same command for all my samples),

  $gatk SortVcf \
        -I "$file" \
        -O "$sorted_file" \
        --CREATE_INDEX true

I can see in the std output that its reading is in the order 1,10-19,2,20-22,X,Y as usual but for only one sample it's reading 1,2,3.. and so on. why would this happen?

The only difference I noticed between the samples was when generating fastq files from the original cram file I changed my command for this particular sample in question where this specific fastq file was generated having reads without the read direction indicator such as /1 or /2 . this should not be an issue at all since there is no question of order before mapping right?

ADD REPLY
1
Entering edit mode

and then vcftools to concat

vcftools is deprecated for this task. Use bcftools.

ADD REPLY
1
Entering edit mode

I would remove the sample with a probkem and re-rerun everything for this sample.

Furthemore, you should use a workflow manager like snakemake or nextflow.

ADD REPLY
0
Entering edit mode

Thank you for that. currently have a master script that submits jobs to LSF for each step. I will try to shift to a workflow manager. And yes, I am currently removing the sample and have run it to see if that is the case.

ADD REPLY
1
Entering edit mode
ADD REPLY
0
Entering edit mode

My assumption was combine and genotype had to be done for the whole vcf file and so I merged them first.

no. imagine you're interested in calling only one gene. The other contigs are unrelated to your gene.

You'll need to merge everything if you use VQSR

ADD REPLY
1
Entering edit mode

Split BAM by Chromosome & index (using samtools)

this is useless, you can use option -L of gatk to call only one chromosome.

ADD REPLY
0
Entering edit mode

the sole reason to split the chromosomes is to save time. the steps are run in parallel for all chromosomes before combining. its saving about ~25 hrs. In reality, the og pipeline ran the entire bam file without intervals.

ADD REPLY
1
Entering edit mode

the sole reason to split the chromosomes is to save time.

you don't save time when creating a new bam when GATK can use intervals.

ADD REPLY
0
Entering edit mode

Yeah, I am using VQSR.The pipeline is expected to find rare germline variants across the whole genome. Currently not looking for anything specific. Will be adding annotations and rankers to filter possible pathogenic variants.

ADD REPLY

Login before adding your answer.

Traffic: 1119 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6