Recalibrated bam file is too small (~48K), fails on haplotypecaller
0
0
Entering edit mode
5.2 years ago
Cricket ▴ 10

I am working on a human genome short variant detection pipeline (to g.vcf) that starts with paired fastq files (WGS). The pipeline fails on the Haplotypecaller:

A USER ERROR has occurred: Argument --sample_name has a bad value: Specified name does not exist in input bam files This sample name is a wildcard in the snakemake pipeline and will be the same string for many of the rules.

However, upon closer inspection, the recalibrated input bam file is entirely too small (~48K) and I believe the error(s) occur upstream of Haplotypecaller. I have tried using a different reference, but each run results in the same sized file.

Below is a list of all called rules & its affiliated command. My questions/concerns are listed inline. Please note I am using hg38 as a reference and you can assume that all reference files have already been indexed.

rule align_pfastq:
     /opt/conda/bin/bwa mem -t 1 -k 100000000 -v 3 -Y ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz data/raw/fastq/samplename_R1.fastq.gz data/raw/fastq/samplename_R2.fastq.gz

rule sammy_sort:
     java -Xmx5g -Xms5g -jar /usr/picard/picard.jar SortSam INPUT=data/endpoints/sam/wgs/hg38/samplename.sam OUTPUT=data/endpoints/sam/wgs/hg38/sorted_samplename.sam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

rule add_readgroups:
     java -Xmx5g -Xms5g -jar /usr/picard/picard.jar AddOrReplaceReadGroups INPUT=data/endpoints/sam/wgs/hg38/sorted_samplename.sam OUTPUT=data/endpoints/sam/wgs/hg38/readgroups_sorted_samplename.sam SORT_ORDER=coordinate RGID=CBV1EANXX.2.89 RGLB=89 RGPL=illumina RGPU=CBV1EANXX.2.89 RGSM=samplename CREATE_INDEX=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

NOTE: The only information about the run I can get is what I find in the fastq files. Here is a header for one of the sequences: @D00687:89:CBV1EANXX:2:2201:1088:2049 1:N:0:ACTATGCA

Based on this, the rgid and rgpu will have the same value of CBV1EANXX.2.89 (instrument name.flow cell lane.run id) and since I don't have and cannot get the DNA preparation library identifier, this value has been assigned of 89 run.id). I am not asserting that the run id is the best replacement for the library identifier, but I have failed to come up with a better option. The rgsm value is the sample name that can be found in the output of the bam file in the mark_dups rule.

rule mark_dups:
     java -Xmx5g -Xms5g -jar /usr/picard/picard.jar MarkDuplicates INPUT=[data/endpoints/sam/wgs/hg38/readgroups_sorted_samplename.sam] OUTPUT=data/endpoints/bam/wgs/hg38/marked_dups_samplename.bam METRICS_FILE=reports/files/marked_dup_metrics_samplename.txt VALIDATION_STRINGENCY=SILENT MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4G H_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false

rule recalibrate:
     java -Xmx3g -Xms3g -jar /gatk/gatk-package-4.1.3.0-local.jar BaseRecalibrator -I data/endpoints/bam/wgs/hg38/marked_dups_samplename.bam -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz --known-sites ref-data/broad/ftp.broadinstitute.org/bundle/hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf -O reports/files/recal_data_samplename.table 2> logs/alignment/recalibrated_samplename.gatk.log

Looking at the marked duplicate bam file, there is one line the an @RG tag: @RG ID:CBV1EANXX.2.89 LB:89 PL:illumina SM:[samplename] PU:CBV1EANXX.2.89

rule ApplyBQSR:
     java -Xmx3g -Xms3g -jar /gatk/gatk-package-4.1.3.0-local.jar ApplyBQSR -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -I data/endpoints/sam/wgs/hg38/sorted_samplename.sam --bqsr-recal-file reports/files/recal_data_samplename.table -O data/endpoints/recal_bam/wgs/hg38/recalibrated_samplename.bam 2> logs/alignment/BQSR_samplename.gatk.log

I know there has to be a problem here because the output file is only 48K and there are no tags that commence with @RG (samtools view -H recalibrated_sample_name.bam | grep '@RG'), but I am at a loss as to why this is occurring and where I turned _left at Albuquerque_. Any assistance will be greatly appreciated.

rule HaplotypeCaller:
     java -Xmx4g -Xms4g -jar /gatk/gatk-package-4.1.3.0-local.jar HaplotypeCaller -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -I data/endpoints/recal_bam/wgs/hg38/recalibrated_samplename.bam -O data/endpoints/gvcfs/wgs/hg38/samplename.g.vcf.gz -ERC GVCF --sample-name samplename 2> logs/alignment/haplotypecaller_samplename.gatk.log

Again, any assistance will be greatly appreciated.

gatk ApplyBQSR short variant detection • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 1021 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6