Question

Recalibrated bam file is too small (~48K), fails on haplotypecaller

0

Entering edit mode

5.7 years ago

Cricket ▴ 10

I am working on a human genome short variant detection pipeline (to g.vcf) that starts with paired fastq files (WGS). The pipeline fails on the Haplotypecaller:

A USER ERROR has occurred: Argument --sample_name has a bad value: Specified name does not exist in input bam files This sample name is a wildcard in the snakemake pipeline and will be the same string for many of the rules.

However, upon closer inspection, the recalibrated input bam file is entirely too small (~48K) and I believe the error(s) occur upstream of Haplotypecaller. I have tried using a different reference, but each run results in the same sized file.

Below is a list of all called rules & its affiliated command. My questions/concerns are listed inline. Please note I am using hg38 as a reference and you can assume that all reference files have already been indexed.

rule align_pfastq:
     /opt/conda/bin/bwa mem -t 1 -k 100000000 -v 3 -Y ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz data/raw/fastq/samplename_R1.fastq.gz data/raw/fastq/samplename_R2.fastq.gz

rule sammy_sort:
     java -Xmx5g -Xms5g -jar /usr/picard/picard.jar SortSam INPUT=data/endpoints/sam/wgs/hg38/samplename.sam OUTPUT=data/endpoints/sam/wgs/hg38/sorted_samplename.sam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

rule add_readgroups:
     java -Xmx5g -Xms5g -jar /usr/picard/picard.jar AddOrReplaceReadGroups INPUT=data/endpoints/sam/wgs/hg38/sorted_samplename.sam OUTPUT=data/endpoints/sam/wgs/hg38/readgroups_sorted_samplename.sam SORT_ORDER=coordinate RGID=CBV1EANXX.2.89 RGLB=89 RGPL=illumina RGPU=CBV1EANXX.2.89 RGSM=samplename CREATE_INDEX=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

NOTE: The only information about the run I can get is what I find in the fastq files. Here is a header for one of the sequences: @D00687:89:CBV1EANXX:2:2201:1088:2049 1:N:0:ACTATGCA

Based on this, the rgid and rgpu will have the same value of CBV1EANXX.2.89 (instrument name.flow cell lane.run id) and since I don't have and cannot get the DNA preparation library identifier, this value has been assigned of 89 run.id). I am not asserting that the run id is the best replacement for the library identifier, but I have failed to come up with a better option. The rgsm value is the sample name that can be found in the output of the bam file in the mark_dups rule.

rule mark_dups:
     java -Xmx5g -Xms5g -jar /usr/picard/picard.jar MarkDuplicates INPUT=[data/endpoints/sam/wgs/hg38/readgroups_sorted_samplename.sam] OUTPUT=data/endpoints/bam/wgs/hg38/marked_dups_samplename.bam METRICS_FILE=reports/files/marked_dup_metrics_samplename.txt VALIDATION_STRINGENCY=SILENT MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4G H_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

rule recalibrate:
     java -Xmx3g -Xms3g -jar /gatk/gatk-package-4.1.3.0-local.jar BaseRecalibrator -I data/endpoints/bam/wgs/hg38/marked_dups_samplename.bam -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz --known-sites ref-data/broad/ftp.broadinstitute.org/bundle/hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf -O reports/files/recal_data_samplename.table 2> logs/alignment/recalibrated_samplename.gatk.log

Looking at the marked duplicate bam file, there is one line the an @RG tag: @RG ID:CBV1EANXX.2.89 LB:89 PL:illumina SM:[samplename] PU:CBV1EANXX.2.89

rule ApplyBQSR:
     java -Xmx3g -Xms3g -jar /gatk/gatk-package-4.1.3.0-local.jar ApplyBQSR -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -I data/endpoints/sam/wgs/hg38/sorted_samplename.sam --bqsr-recal-file reports/files/recal_data_samplename.table -O data/endpoints/recal_bam/wgs/hg38/recalibrated_samplename.bam 2> logs/alignment/BQSR_samplename.gatk.log

I know there has to be a problem here because the output file is only 48K and there are no tags that commence with @RG (samtools view -H recalibrated_sample_name.bam | grep '@RG'), but I am at a loss as to why this is occurring and where I turned _left at Albuquerque_. Any assistance will be greatly appreciated.

rule HaplotypeCaller:
     java -Xmx4g -Xms4g -jar /gatk/gatk-package-4.1.3.0-local.jar HaplotypeCaller -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -I data/endpoints/recal_bam/wgs/hg38/recalibrated_samplename.bam -O data/endpoints/gvcfs/wgs/hg38/samplename.g.vcf.gz -ERC GVCF --sample-name samplename 2> logs/alignment/haplotypecaller_samplename.gatk.log

Again, any assistance will be greatly appreciated.

gatk ApplyBQSR short variant detection • 1.2k views

ADD COMMENT • link 5.7 years ago by Cricket ▴ 10