I am attempting to use Salmon (version 0.12.0) to quantify transcript counts from RNAseq data (unstranded paired-end reads).
The existing pipeline in my lab is to trim/qc fastq files, align them with STAR (2.5.2a), merge sample BAMs together (if they were spread across lanes), sort them with samtools, and then feed them to Salmon in alignment-based mode. STAR was run with the --quantMode TranscriptomeSAM
option and the genomeDir
pointed to a genome generated using STAR's genomeGenerate
function with the --sjdbGTFfile
option pointing to a GTF file.
Here's the Salmon run line:
salmon quant -t /hg38_salmon_transcriptome.fa -l IU -p 16 -a some.transcriptome.sorted.bam -o ./
This produces a quant.sf file that seems to make sense, although it also produces a massive (15+ GB) error file that seems really upset about suspicious pairs (see here for someone else with the same issue: Salmon warning detected suspicious pair )
WARNING: Detected suspicious pair ---
The names are different:
read1 : K00274:68:HGYCHBBXX:5:1119:13311:37220
read2 : K00274:68:HGYCHBBXX:3:2103:3325:33598
I was a bit sick of this behavior and decided to give Salmon the fastq files directly in alignment-free mode. This pipeline was to trim/qc my fastqs, combine the files from different lanes and then gives the reads to salmon. The program runs fine and produces no errors.
Here's that Salmon run line:
salmon quant -i /hg38_salmon_transcriptome_index -l IU -1 samp_R1.fq.gz -2 samp_R2.fq.gz -o ./
The big issue is that the quant.sf files look really different between these two pipelines. Like the transcript ENST00000361739 (this is the MT-CO1 gene) has a TPM of 40735 in alignment-free mode, but a TPM of 105 in alignment-mode. Further, the range of TPMs (and counts) varies widely between the two modes. In alignment-free mode, I'm getting TPMs in the thousands, but in alignment mode the highest is 300 and most values are under 100.
So, my questions are:
- Does anyone know what's causes the "suspicious pair" error when I'm running in alignment-based mode?
- Why am I getting such huge differences in the outputs of these two strategies?
Upon further testing, sorting the BAM file is the culprit. This must have been included by mistake in the pipeline I received.
Sorry, I forgot part of the pipeline.
The BAM file was generated in star (unsorted) but then sorted by samtools:
Here's the STAR arguments:
Here is the samtools run line:
See
Note
in salmon documentation. Input BAM files shouldn't be sorted.Note that STAR (as far as I can tell) will not sort the output of transcriptomic mapping (
--quantMode TranscriptomeSAM
), even if you do specify--outSAMtype BAM SortedByCoordinate
. Only the genomic bam will be sorted.Yes, it seems that the output was sorted separately, using
samtools
, after having been generated by STAR. STAR has this behavior, I believe, because the transcriptome output is designed specifically for use with tools that have similar requirements to salmon (i.e. all alignments for a read are together and mates are adjacent in the file).