Salmon Counts differ in Alignment and Quasi-Mapping mode
1
2
Entering edit mode
5.9 years ago
MaxF ▴ 120

I am attempting to use Salmon (version 0.12.0) to quantify transcript counts from RNAseq data (unstranded paired-end reads).

The existing pipeline in my lab is to trim/qc fastq files, align them with STAR (2.5.2a), merge sample BAMs together (if they were spread across lanes), sort them with samtools, and then feed them to Salmon in alignment-based mode. STAR was run with the --quantMode TranscriptomeSAM option and the genomeDir pointed to a genome generated using STAR's genomeGenerate function with the --sjdbGTFfile option pointing to a GTF file.

Here's the Salmon run line:

salmon quant -t /hg38_salmon_transcriptome.fa -l IU -p 16 -a some.transcriptome.sorted.bam -o ./

This produces a quant.sf file that seems to make sense, although it also produces a massive (15+ GB) error file that seems really upset about suspicious pairs (see here for someone else with the same issue: Salmon warning detected suspicious pair )

WARNING: Detected suspicious pair ---
        The names are different:
        read1 : K00274:68:HGYCHBBXX:5:1119:13311:37220
        read2 : K00274:68:HGYCHBBXX:3:2103:3325:33598

I was a bit sick of this behavior and decided to give Salmon the fastq files directly in alignment-free mode. This pipeline was to trim/qc my fastqs, combine the files from different lanes and then gives the reads to salmon. The program runs fine and produces no errors.

Here's that Salmon run line:

salmon quant -i /hg38_salmon_transcriptome_index -l IU -1 samp_R1.fq.gz -2 samp_R2.fq.gz -o ./

The big issue is that the quant.sf files look really different between these two pipelines. Like the transcript ENST00000361739 (this is the MT-CO1 gene) has a TPM of 40735 in alignment-free mode, but a TPM of 105 in alignment-mode. Further, the range of TPMs (and counts) varies widely between the two modes. In alignment-free mode, I'm getting TPMs in the thousands, but in alignment mode the highest is 300 and most values are under 100.

So, my questions are:

  1. Does anyone know what's causes the "suspicious pair" error when I'm running in alignment-based mode?
  2. Why am I getting such huge differences in the outputs of these two strategies?
salmon RNA-Seq alignment • 4.9k views
ADD COMMENT
5
Entering edit mode
5.9 years ago
Rob 6.9k

The command line you show for running using the STAR alignments:

salmon quant -t /hg38_salmon_transcriptome.fa -l IU -p 16 -a some.transcriptome.sorted.bam -o ./

suggests that the bam file provided to salmon was (coordinate?) sorted; is this the case? In alignment mode, salmon, like e.g. RSEM, expects all of the alignments for a read to be adjacent in the BAM file, and for the mates of a pair to follow each other. STAR will output the alignments like this by default, as long as you don't ask it to sort the output. Could you verify if this is the case and, if so, see how things look if you don't coordinate sort the output bam?

ADD COMMENT
1
Entering edit mode

Upon further testing, sorting the BAM file is the culprit. This must have been included by mistake in the pipeline I received.

ADD REPLY
0
Entering edit mode

Sorry, I forgot part of the pipeline.

The BAM file was generated in star (unsorted) but then sorted by samtools:

Here's the STAR arguments:

--outSAMtype BAM Unsorted --outSAMattributes All --alignSJoverhangMin 8 --outSAMunmapped Within

Here is the samtools run line:

samtools sort -@ 8 -o some.transcriptome.sorted.bam some.transcriptome.bam
ADD REPLY
2
Entering edit mode

See Note in salmon documentation. Input BAM files shouldn't be sorted.

ADD REPLY
0
Entering edit mode

Note that STAR (as far as I can tell) will not sort the output of transcriptomic mapping (--quantMode TranscriptomeSAM), even if you do specify --outSAMtype BAM SortedByCoordinate. Only the genomic bam will be sorted.

ADD REPLY
0
Entering edit mode

Yes, it seems that the output was sorted separately, using samtools, after having been generated by STAR. STAR has this behavior, I believe, because the transcriptome output is designed specifically for use with tools that have similar requirements to salmon (i.e. all alignments for a read are together and mates are adjacent in the file).

ADD REPLY

Login before adding your answer.

Traffic: 1717 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6