Question

GATK failing to find any variants - how to troubleshoot?

1

Entering edit mode

5.7 years ago

Kristin Muench ▴ 640

Hello,

I am trying to generate a .vcf file and call variants in RNA-Seq data, as per the GATK3 tutorial: https://software.broadinstitute.org/gatk/documentation/article.php?id=3891

I can get all the way through making a filtered .vcf file, but the resulting file has no entries - i.e., the .vcf file is only 47 lines, and all of that is header information. The columns are empty. The unfiltered .vcf file in the HaplotypeCaller is also empty.

My questions are:

First - as a sanity check - this is an unlikely outcome, right? It's not that my RNA-Seq coverage was too shallow (~30,000,000 paired end reads) and therefore we are unable to detect any variants?
Second - if we should, indeed, see some variants in that file - since it's hard to tell if there are variants at all until you actually generate a .vcf file (i.e., you can't look in a bam file and decide that variants exist) - what parameters should I tweak in order to try and get variants in my final .vcf file? I'm not sure what intermediate steps I can look at to troubleshoot where in the pipeline things are going wrong.

I didn't perform Indel Realignment (step #4) or base recalibration (step #5) because it was unclear to me from the tutorial if this was necessary for RNA-Seq. Right before step 6, I also had to use ReorderSam and BuildBamIndex in order to make the intermediate bam files generated match the ordering of the .fa file I am using as an index. I'll work on getting base recalibration working next, but if anyone has additional suggestions or answers to the above, I'd appreciate it.

Thank you for your help!

Kristin

GATK RNA-Seq • 4.1k views

ADD COMMENT • link 5.7 years ago by Kristin Muench ▴ 640

1

Entering edit mode

Hi Kristin, I really just have a lot of reservations about that RNA-seq variant calling pipeline, which I outline here: A: Inferring genotype based on RNA sequnces

I would suggest to your lab to do Exome-seq.

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Ooph, thank you for bringing this to my attention. I wonder if GATK+RNA-Seq might be useful in our case because we are using this .vcf for demultiplexing of scRNA-Seq data: https://www.nature.com/articles/nbt.4042

We only need to identify ~50 variants reliably for this case, and since we're generating variants from bulk RNA-Seq to support a demultiplexing based on a different RNA-Seq-based methodology, my hope is that it might be more similar than going from genome to exome...? Certainly alarmed by your post, though, and will talk to my lab about what else we can do.

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

1

Entering edit mode

Oh, don't worry, I alarm everybody.

I see.. you're using the technique not just for calling variants but for cell / sample identity. Provided you followed the protocol as per the docs, I see no reason why you would not detect the variants, apart from the fact that the Broad Institute admit quite openly that their pipeline has issues:

Finally, we know that the current recommended pipeline is producing both false positives (wrong variant calls) and false negatives (missed variants) errors. While some of those errors are inevitable in any pipeline, others are errors that we can and will address in future versions of the pipeline.

...and was only tested on a single sample:

In addition, we have currently worked only on data from one tissue from one individual.

So, on face value, this is a World renowned research institute putting an unproven pipeline/method on their web domain. This is my main gripe with it to date.

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

1

Entering edit mode

That is a serious gripe! I'm especially concerned if flaws would lead to issues with the demultiplexing. We should have a couple of methods for testing how good our demultiplexing ends up being, but we'll keep this in mind when considering our sanity checks. Thank you for pointing this out.

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

1

Entering edit mode

30M reads is not too shallow. You should be able to get variants with less than 1M reads. I am assuming we are talking about human/mouse.

You should post the HaplotypeCaller command and output. Were there any errors or warnings? GATK tools in general are very good about reporting any issues.

ADD REPLY • link 5.7 years ago by igor 13k

0

Entering edit mode

Sure. The command I used for the HaplotypeCaller was:

java -jar $gatk -T HaplotypeCaller -R $refFasta -I "$outputBamLoc/split_reordered.bam" -dontUseSoftClippedBases -stand_call_conf 20.0 -o output.vcf

The commands leading up to this HaplotypeCaller command, starting from the two-pass STAR aligned file, was:

#2.Add read groups, sort, mark duplicates, and create index
picard AddOrReplaceReadGroups I="$outputBamLoc/Aligned.out.sam" O="$outputBamLoc/rg_added_sorted.bam" SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample CREATE_INDEX=true

picard MarkDuplicates I="$outputBamLoc/rg_added_sorted.bam" O="$outputBamLoc/rg_dedupped.bam"  CREATE_INDEX=true VALIDATION_STRINGENCY=SILENT M=output.metrics 

# 3. Split'N'Trim and reassign mapping qualities
java -jar $gatk -T SplitNCigarReads \
   -R $refFasta \
   -I "$outputBamLoc/rg_dedupped.bam" \
   -o "$outputBamLoc/split.bam" \
   -rf ReassignOneMappingQuality \
   -RMQF 255 -RMQT 60 -U ALLOW_N_CIGAR_READS

samtools view -h "$outputBamLoc/split.bam" > "$outputBamLoc/split.sam"

# Skipping steps 4 and 5 in best practices

# Step I added in: reorder sam to match the .fa file so the next step doesn't throw an error
picard ReorderSam \
INPUT="$outputBamLoc/split.bam" \
OUTPUT="$outputBamLoc/split_reordered.bam" \
REFERENCE=$refFasta

## build new index
picard BuildBamIndex I="$outputBamLoc/split_reordered.bam"

The output is long, so here are the lines that were anything other than "Successfully loaded .bam" file or a printed report ("INFO") on a setting:

...
WARN  09:55:55,232 InbreedingCoeff - Annotation will not be calculated. InbreedingCoeff requires at least 10 unrelated samples. 
...
WARN  10:14:45,889 PairHMMLikelihoodCalculationEngine$1 - OpenMP multi-threaded AVX-accelerated native PairHMM implementation is not supported 
WARN  10:14:45,889 PairHMMLikelihoodCalculationEngine$1 - AVX-accelerated native PairHMM implementation is not supported. Falling back to slower LOGLESS_CACHING implementation

And the total compute time of 0 looks suspicious to me:

...
INFO  10:14:45,890 PairHMM - Total compute time in PairHMM computeLikelihoods() : 0.0 
INFO  10:14:45,890 HaplotypeCaller - Ran local assembly on 0 active regions 
INFO  10:14:45,911 ProgressMeter -            done    3.088286401E9    18.8 m            0.0 s      100.0%    18.8 m       0.0 s 
INFO  10:14:45,911 ProgressMeter - Total runtime 1130.74 secs, 18.85 min, 0.31 hours

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

0

Entering edit mode

That all seems fine. The part that's a little suspicious to me is "reorder sam to match the .fa file". Were different reference FASTA files was used for different steps? GATK can be picky about details like this.

ADD REPLY • link 5.7 years ago by igor 13k

0

Entering edit mode

Hmm, everything should have been done using the same .fa file (GRCh38 downloaded as described here) (**EDIT: I modified this to use release 95). I added that line in because I kept getting a "Dictionary cannot have size zero" error on the HaplotypeCaller step. According to the internet, it looked like this might be because of a difference in sorting between split.sam and the .fa file. Indeed, when I looked at split.sam, the header read:

@HD VN:1.5 GO:none SO:coordinate
@RG ID:id LB:library PL:platform SM:sample PU:machine@PG ID:GATK SplitNCigarReads VN:3.8-1-0-gf15c1c3ef CL:maxReadsInMemory=150000 maxMismatchesInOverhang=1 maxBasesInOverhang=40 doNotFixOverhangs=false no_pg_tag=false

While meanwhile the GRCh38_r95.all.dict generated by one of the picard commands (CreateSequenceDictionary) lacked the "SO:" field in the first line completely, and had VN: 1.6. I took this to mean "sort order" and introduced the ReorderSam step to match the fasta, and this allowed the HaplotypeCaller to run.

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Update: I have rerun my whole pipeline and re-derived the first/second pass alignment using STAR in the manner recommended on GATK best practices. I am increasingly suspicious that the issue comes down to reconciling the sort order of the .fa file, which I downloaded from GRCh38, and the coordinate-based sort order imposed in this line:

java -jar picard.jar AddOrReplaceReadGroups I=star_output.sam O=rg_added_sorted.bam SO=coordinate RGID=id RGLB=library RGPL=platform RGPU=machine RGSM=sample

It's unclear to me why there isn't a step in this tutorial that explicitly re-orders the .fa, or why we need to do a coordinate-based sort in the first place.

Does anyone have experience in getting rid of the "Dictionary cannot have size zero" error? It doesn't seem possible to reorder by coordinate in the CreateSequenceDictionary command.

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

0

Entering edit mode

How did you create the STAR indices?

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

The first STAR index was generated from the downloaded .fa using:

STAR --runMode genomeGenerate --genomeDir $outputDir --genomeFastaFiles $fastaDir --runThreadN 4 --outFileNamePrefix $outputDir/

And this was used to do the first pass alignment:

STAR --genomeDir $idx1Loc --readFilesIn "$sample$r1_end" "$sample$r2_end" --runThreadN ${SLURM_NPROCS:-1}

And that output was used to generate the second pass index:

STAR --runMode genomeGenerate --genomeDir "$outputLoc/$samp_ID/$align2Loc" --genomeFastaFiles $fastaDir \
 --sjdbFileChrStartEnd $pathToSjdb --sjdbOverhang 75 --runThreadN ${SLURM_NPROCS:-1}

And finally, the second pass alignment:

STAR --genomeDir "$outputLoc/$samp_ID/$align2Loc" --readFilesIn "$sample$r1_end" "$sample$r2_end" --runThreadN ${SLURM_NPROCS:-1}

At some point, I also ran these commands:

# create sequence directory
cd $fastaDir

picard CreateSequenceDictionary R= GRCh38_r95.all.fa O= GRCh38_r95.all.dict

# create the fasta index
## faidx manual page: http://www.htslib.org/doc/faidx.html
samtools faidx GRCh38_r95.all.fa

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

0

Entering edit mode

Hey, the first thing that I noted was that you don't use the same STAR indices that you created (?). In the first command, you specify --genomeDir $outputDir; then, on the second, you specify --genomeDir $idx1Loc.

Also, I am not sure about the usage of this, --outFileNamePrefix $outputDir/, in the first command. I think that this should just be some string that is then pre-pended to the names of the index files (?)

The third command also looks different from those in the tutorial: https://software.broadinstitute.org/gatk/documentation/article.php?id=3891

ADD REPLY • link 5.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you for taking the time to help with this! This pipeline has been a doozy of a troubleshoot.

Regarding $outputDir vs. $idx1Loc: My apologies, these variables both refer to the same path (/resources/KM_GRCh38/star_indices_gatkTutorial_firstPass_STAR_2_7_0d_noGtf) - I ran these in separate scripts and forgot to correct the variable names when I pasted them here.

On the --outFileNamePrefix: Hmm, you're right! I don't remember why I chose to include this, but I'll take it out.

Regarding the third line - if I'm reading this correctly, the only difference is ${SLURM_NPROCS:-1}. Apologies - I should have edited this as well - that is a variable related to our particular job queue system that will assign a number of threads based on what is available at the moment. I could have also replaced this variable with 4. If I do so, the command looks like the one presented in the tutorial:

STAR --runMode genomeGenerate --genomeDir "$outputLoc/$samp_ID/$align2Loc" --genomeFastaFiles $fastaDir --sjdbFileChrStartEnd $pathToSjdb --sjdbOverhang 75 --runThreadN 4

I'll rerun everything with these tweaks (runThreadN = 4, removing OutfileNamePrefix) in case that affects anything.

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

0

Entering edit mode

I have used GATK with a few different species and despite many FASTA-related errors, I have never needed to re-order it.

My .dict files were generated with Picard CreateSequenceDictionary and all have "SO:unsorted", so it's odd that yours are missing that.

ADD REPLY • link 5.7 years ago by igor 13k

0

Entering edit mode

Yeah, I just tried it again and there is no "SO" flag in the .dict file I produced using the code I posted above:

@HD     VN:1.6
@SQ     SN:10   LN:133797422    M5:907112d17fcb73bcab1ed1c72b97ce68     UR:file:/resources/KM_GRCh38/sequence/GRCh38_r95.all.fa
@SQ     SN:11   LN:135086622    M5:1511375dc2dd1b633af8cf439ae90cec     UR:file:/resources/KM_GRCh38/sequence/GRCh38_r95.all.fa

May I ask what version of Picard you are using? I'm using picard/2.18.29. I wonder if that might affect what flags get added. The code I used to generate the .dict file is in my comment to Kevin Blighe above.

EDIT: It looks like the version of picard I am using should be pretty close to the latest release, yet I notice GATK4 (which is apparently not used for RNA-Seq variant calling at all) includes a version of picard that has additional CreateSequenceDictionary flags including one to create a sorted file. Is this what people are using? Do you have to use the picard installation specifically that comes with GATK, not simply the picard toolset downloaded separately?

EDIT 2: Aha! A clue. When I use picard/1.141, and the exact same code, I can create a dict file with the "SO" flag. This still throws the "Dictionary cannot have size zero" error, though, because this header is "SO:unsorted", whereas the split.bam file is "SO:coordinate" - and I'm not sure how to make the coordinate-sorted fasta in a way that GATK3 likes:

@HD     VN:1.5  SO:unsorted
@SQ     SN:10   LN:133797422    M5:907112d17fcb73bcab1ed1c72b97ce68     UR:file:/resources/KM_GRCh38/sequence/GRCh38_r95.all.fa
@SQ     SN:11   LN:135086622    M5:1511375dc2dd1b633af8cf439ae90cec     UR:file:/resources/KM_GRCh38/sequence/GRCh38_r95.all.fa

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

1

Entering edit mode

I've used "SO:unsorted" dict files with GATK 3 and 4. It has never been an issue. There has to be some other problem.

For example, Kevin already pointed out some inconsistencies in your STAR commands.

ADD REPLY • link 5.7 years ago by igor 13k

0

Entering edit mode

Hmm, that's good to know it's unlikely to be the SO:unsorted dict files. Thank you so much for providing feedback on this - tremendously helpful to have a sense for what worked and didn't for other people.

ADD REPLY • link 5.7 years ago by Kristin Muench ▴ 640

score 1 · Accepted Answer · 2019-03-31

Thank you everyone for providing valuable feedback that helped me narrow down the world of what could possibly be going wrong and identifying the key issue.

Ultimately, the problem was that I hadn't noticed that at some point the second alignment had failed, and I was trying to run all these commands on an empty .bam file. Thus - the problem wasn't the .dict at all - but the files that were being used for calling.

I'm a little confused about how I missed this because I had been using samtools to take a peek at every file and make sure they looked fine - but this troubleshooting process has taken so many iterations that I must have missed something.

I was inspired by this post, where someone encounters a similar issue: https://gatkforums.broadinstitute.org/gatk/discussion/10130/gatk-3-8-dictionary-cannot-have-size-zero

Now that I went back and make sure the first and second alignment steps produced .bam files, I do not get the .dict error, and my .vcf files have variants written in them.

Thank you all!