Skip to the last paragraph to go straight to the question. I am interested in looking at variants from a RNA-seq sample of GM12878 and comparing to variants found by GIAB. The basic outline of my pipeline is the following
1) Trim reads with ngShoRT.
2) Align reads with STAR to the reference genome. Keep only uniquely mapped reads.
3) Call SNPs with samtools/bcftools
Comparing to GIAB, ~60% of the discovered variants match up with GIAB (is that a reasonable amount with just 1 sample?).
To understand why some of the variants found by samtools/bcftools don't match up to GIAB, I would like to focus only on SNPs within highly expressed genes. To calculate gene expression, I want to use TPM values from Salmon. I've learned that popular quantification tools (Salmon, RSEM, etc.) are based on alignments to the transcriptome. However, only 82% of the reads aligned by STAR overlap an exon. Related question: previously been addressed. So why should I trust a quantification tool that is based on the transcriptome when 18% of the reads don't align to an exon?
I do not know about Salmon, but you are wrong about RSEM: it may estimate read counts either from a transcriptome, or using a genome + GTF file with gene annotations.
Salmon works the same way as RSEM in this regard. For example if you run RSEM with the genome and a GTF file, you first run
rsem-prepare-reference
which extracts the transcripts from the genome, and then you align to that transcriptome. RSEM doesn't deal with alignments to the genome directly. Conversely, with a tool like STAR, you can also "project" genomic alignments (with the help of a GTF) directly to transcriptomic coordinates. In this case, RSEM, Salmon and eXpress are capable of processing these alignments.