Hello,
I am trying to generate a .vcf file and call variants in RNA-Seq data, as per the GATK3 tutorial: https://software.broadinstitute.org/gatk/documentation/article.php?id=3891
I can get all the way through making a filtered .vcf file, but the resulting file has no entries - i.e., the .vcf file is only 47 lines, and all of that is header information. The columns are empty. The unfiltered .vcf file in the HaplotypeCaller is also empty.
My questions are:
- First - as a sanity check - this is an unlikely outcome, right? It's not that my RNA-Seq coverage was too shallow (~30,000,000 paired end reads) and therefore we are unable to detect any variants?
- Second - if we should, indeed, see some variants in that file - since it's hard to tell if there are variants at all until you actually generate a .vcf file (i.e., you can't look in a bam file and decide that variants exist) - what parameters should I tweak in order to try and get variants in my final .vcf file? I'm not sure what intermediate steps I can look at to troubleshoot where in the pipeline things are going wrong.
I didn't perform Indel Realignment (step #4) or base recalibration (step #5) because it was unclear to me from the tutorial if this was necessary for RNA-Seq. Right before step 6, I also had to use ReorderSam and BuildBamIndex in order to make the intermediate bam files generated match the ordering of the .fa file I am using as an index. I'll work on getting base recalibration working next, but if anyone has additional suggestions or answers to the above, I'd appreciate it.
Thank you for your help!
Kristin
Hi Kristin, I really just have a lot of reservations about that RNA-seq variant calling pipeline, which I outline here: A: Inferring genotype based on RNA sequnces
I would suggest to your lab to do Exome-seq.
Ooph, thank you for bringing this to my attention. I wonder if GATK+RNA-Seq might be useful in our case because we are using this .vcf for demultiplexing of scRNA-Seq data: https://www.nature.com/articles/nbt.4042
We only need to identify ~50 variants reliably for this case, and since we're generating variants from bulk RNA-Seq to support a demultiplexing based on a different RNA-Seq-based methodology, my hope is that it might be more similar than going from genome to exome...? Certainly alarmed by your post, though, and will talk to my lab about what else we can do.
Oh, don't worry, I alarm everybody.
I see.. you're using the technique not just for calling variants but for cell / sample identity. Provided you followed the protocol as per the docs, I see no reason why you would not detect the variants, apart from the fact that the Broad Institute admit quite openly that their pipeline has issues:
...and was only tested on a single sample:
So, on face value, this is a World renowned research institute putting an unproven pipeline/method on their web domain. This is my main gripe with it to date.
That is a serious gripe! I'm especially concerned if flaws would lead to issues with the demultiplexing. We should have a couple of methods for testing how good our demultiplexing ends up being, but we'll keep this in mind when considering our sanity checks. Thank you for pointing this out.
30M reads is not too shallow. You should be able to get variants with less than 1M reads. I am assuming we are talking about human/mouse.
You should post the HaplotypeCaller command and output. Were there any errors or warnings? GATK tools in general are very good about reporting any issues.
Sure. The command I used for the HaplotypeCaller was:
The commands leading up to this HaplotypeCaller command, starting from the two-pass STAR aligned file, was:
The output is long, so here are the lines that were anything other than "Successfully loaded .bam" file or a printed report ("INFO") on a setting:
And the total compute time of 0 looks suspicious to me:
That all seems fine. The part that's a little suspicious to me is "reorder sam to match the .fa file". Were different reference FASTA files was used for different steps? GATK can be picky about details like this.
Hmm, everything should have been done using the same .fa file (GRCh38 downloaded as described here) (**EDIT: I modified this to use release 95). I added that line in because I kept getting a "Dictionary cannot have size zero" error on the HaplotypeCaller step. According to the internet, it looked like this might be because of a difference in sorting between split.sam and the .fa file. Indeed, when I looked at split.sam, the header read:
While meanwhile the GRCh38_r95.all.dict generated by one of the picard commands (CreateSequenceDictionary) lacked the "SO:" field in the first line completely, and had VN: 1.6. I took this to mean "sort order" and introduced the ReorderSam step to match the fasta, and this allowed the HaplotypeCaller to run.
Update: I have rerun my whole pipeline and re-derived the first/second pass alignment using STAR in the manner recommended on GATK best practices. I am increasingly suspicious that the issue comes down to reconciling the sort order of the .fa file, which I downloaded from GRCh38, and the coordinate-based sort order imposed in this line:
It's unclear to me why there isn't a step in this tutorial that explicitly re-orders the .fa, or why we need to do a coordinate-based sort in the first place.
Does anyone have experience in getting rid of the "Dictionary cannot have size zero" error? It doesn't seem possible to reorder by coordinate in the CreateSequenceDictionary command.
How did you create the STAR indices?
The first STAR index was generated from the downloaded .fa using:
And this was used to do the first pass alignment:
And that output was used to generate the second pass index:
And finally, the second pass alignment:
At some point, I also ran these commands:
Hey, the first thing that I noted was that you don't use the same STAR indices that you created (?). In the first command, you specify
--genomeDir $outputDir
; then, on the second, you specify--genomeDir $idx1Loc
.Also, I am not sure about the usage of this,
--outFileNamePrefix $outputDir/
, in the first command. I think that this should just be some string that is then pre-pended to the names of the index files (?)The third command also looks different from those in the tutorial: https://software.broadinstitute.org/gatk/documentation/article.php?id=3891
Thank you for taking the time to help with this! This pipeline has been a doozy of a troubleshoot.
Regarding $outputDir vs. $idx1Loc: My apologies, these variables both refer to the same path (/resources/KM_GRCh38/star_indices_gatkTutorial_firstPass_STAR_2_7_0d_noGtf) - I ran these in separate scripts and forgot to correct the variable names when I pasted them here.
On the --outFileNamePrefix: Hmm, you're right! I don't remember why I chose to include this, but I'll take it out.
Regarding the third line - if I'm reading this correctly, the only difference is ${SLURM_NPROCS:-1}. Apologies - I should have edited this as well - that is a variable related to our particular job queue system that will assign a number of threads based on what is available at the moment. I could have also replaced this variable with 4. If I do so, the command looks like the one presented in the tutorial:
I'll rerun everything with these tweaks (runThreadN = 4, removing OutfileNamePrefix) in case that affects anything.
I have used GATK with a few different species and despite many FASTA-related errors, I have never needed to re-order it.
My .dict files were generated with Picard CreateSequenceDictionary and all have "SO:unsorted", so it's odd that yours are missing that.
Yeah, I just tried it again and there is no "SO" flag in the .dict file I produced using the code I posted above:
May I ask what version of Picard you are using? I'm using picard/2.18.29. I wonder if that might affect what flags get added. The code I used to generate the .dict file is in my comment to Kevin Blighe above.
EDIT: It looks like the version of picard I am using should be pretty close to the latest release, yet I notice GATK4 (which is apparently not used for RNA-Seq variant calling at all) includes a version of picard that has additional CreateSequenceDictionary flags including one to create a sorted file. Is this what people are using? Do you have to use the picard installation specifically that comes with GATK, not simply the picard toolset downloaded separately?
EDIT 2: Aha! A clue. When I use picard/1.141, and the exact same code, I can create a dict file with the "SO" flag. This still throws the "Dictionary cannot have size zero" error, though, because this header is "SO:unsorted", whereas the split.bam file is "SO:coordinate" - and I'm not sure how to make the coordinate-sorted fasta in a way that GATK3 likes:
I've used "SO:unsorted" dict files with GATK 3 and 4. It has never been an issue. There has to be some other problem.
For example, Kevin already pointed out some inconsistencies in your STAR commands.
Hmm, that's good to know it's unlikely to be the SO:unsorted dict files. Thank you so much for providing feedback on this - tremendously helpful to have a sense for what worked and didn't for other people.