I am working on a human genome short variant detection pipeline (to g.vcf) that starts with paired fastq files (WGS). The pipeline fails on the Haplotypecaller:
A USER ERROR has occurred: Argument --sample_name has a bad value: Specified name does not exist in input bam files
This sample name is a wildcard in the snakemake pipeline and will be the same string for many of the rules.
However, upon closer inspection, the recalibrated input bam file is entirely too small (~48K) and I believe the error(s) occur upstream of Haplotypecaller. I have tried using a different reference, but each run results in the same sized file.
Below is a list of all called rules & its affiliated command. My questions/concerns are listed inline. Please note I am using hg38 as a reference and you can assume that all reference files have already been indexed.
rule align_pfastq:
/opt/conda/bin/bwa mem -t 1 -k 100000000 -v 3 -Y ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz data/raw/fastq/samplename_R1.fastq.gz data/raw/fastq/samplename_R2.fastq.gz
rule sammy_sort:
java -Xmx5g -Xms5g -jar /usr/picard/picard.jar SortSam INPUT=data/endpoints/sam/wgs/hg38/samplename.sam OUTPUT=data/endpoints/sam/wgs/hg38/sorted_samplename.sam SORT_ORDER=coordinate VALIDATION_STRINGENCY=SILENT VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
rule add_readgroups:
java -Xmx5g -Xms5g -jar /usr/picard/picard.jar AddOrReplaceReadGroups INPUT=data/endpoints/sam/wgs/hg38/sorted_samplename.sam OUTPUT=data/endpoints/sam/wgs/hg38/readgroups_sorted_samplename.sam SORT_ORDER=coordinate RGID=CBV1EANXX.2.89 RGLB=89 RGPL=illumina RGPU=CBV1EANXX.2.89 RGSM=samplename CREATE_INDEX=true VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
NOTE: The only information about the run I can get is what I find in the fastq files. Here is a header for one of the sequences: @D00687:89:CBV1EANXX:2:2201:1088:2049 1:N:0:ACTATGCA
Based on this, the rgid and rgpu will have the same value of CBV1EANXX.2.89
(instrument name.flow cell lane.run id) and since I don't have and cannot get the DNA preparation library identifier, this value has been assigned of 89
run.id). I am not asserting that the run id is the best replacement for the library identifier, but I have failed to come up with a better option. The rgsm
value is the sample name that can be found in the output of the bam file in the mark_dups
rule.
rule mark_dups:
java -Xmx5g -Xms5g -jar /usr/picard/picard.jar MarkDuplicates INPUT=[data/endpoints/sam/wgs/hg38/readgroups_sorted_samplename.sam] OUTPUT=data/endpoints/bam/wgs/hg38/marked_dups_samplename.bam METRICS_FILE=reports/files/marked_dup_metrics_samplename.txt VALIDATION_STRINGENCY=SILENT MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4G H_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
rule recalibrate:
java -Xmx3g -Xms3g -jar /gatk/gatk-package-4.1.3.0-local.jar BaseRecalibrator -I data/endpoints/bam/wgs/hg38/marked_dups_samplename.bam -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz --known-sites ref-data/broad/ftp.broadinstitute.org/bundle/hg38/resources_broad_hg38_v0_Homo_sapiens_assembly38.dbsnp138.vcf -O reports/files/recal_data_samplename.table 2> logs/alignment/recalibrated_samplename.gatk.log
Looking at the marked duplicate bam file, there is one line the an @RG tag: @RG ID:CBV1EANXX.2.89 LB:89 PL:illumina SM:[samplename] PU:CBV1EANXX.2.89
rule ApplyBQSR:
java -Xmx3g -Xms3g -jar /gatk/gatk-package-4.1.3.0-local.jar ApplyBQSR -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -I data/endpoints/sam/wgs/hg38/sorted_samplename.sam --bqsr-recal-file reports/files/recal_data_samplename.table -O data/endpoints/recal_bam/wgs/hg38/recalibrated_samplename.bam 2> logs/alignment/BQSR_samplename.gatk.log
I know there has to be a problem here because the output file is only 48K and there are no tags that commence with @RG
(samtools view -H recalibrated_sample_name.bam | grep '@RG'), but I am at a loss as to why this is occurring and where I turned _left at Albuquerque_. Any assistance will be greatly appreciated.
rule HaplotypeCaller:
java -Xmx4g -Xms4g -jar /gatk/gatk-package-4.1.3.0-local.jar HaplotypeCaller -R ref-data/broad/ftp.broadinstitute.org/bundle/hg38/Homo_sapiens_assembly38.fasta.gz -I data/endpoints/recal_bam/wgs/hg38/recalibrated_samplename.bam -O data/endpoints/gvcfs/wgs/hg38/samplename.g.vcf.gz -ERC GVCF --sample-name samplename 2> logs/alignment/haplotypecaller_samplename.gatk.log
Again, any assistance will be greatly appreciated.