Hello !
I have been using Salmon to quantify RNA-seq of tumoral tissues using a quasi index of 161 transcripts, part of them being endogenous retroviruses sequences. Here is the command info of the quantification of a sample :
{
"salmon_version": "0.10.2",
"index": "indexes/DB7",
"libType": "A",
"mates1": "output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed1.fastq",
"mates2": "output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed2.fastq",
"threads": "8",
"numBootstraps": "100",
"seqBias": [],
"gcBias": [],
"writeMappings": "bam/3963279a-4960-49a2-936a-a13bb4a7dde5",
"output": "output/HSNC/salmon/3963279a-4960-49a2-936a-a13bb4a7dde5",
"auxDir": "aux_info"
}
I've also used the "--writeMappings" argument for some samples and with a combination of samtools view, sort, and index, created sorted BAM files in order to vizualize them in IGView (I also created with IGView a pseudo-genome based on my set of transcript). However, for some transcripts, there are big differences between the observed coverage and the actual quantification computed by Salmon. I've read the documentation about that and understand that the mappings are computed before the quantification, so they're bound to be different. However, for some transcripts, particularly a family of retroviral enveloppes (K family), there are huge gaps sometimes. For some transcripts, the quantification gives a NumReads of 0 while hundreads or thousands of reads can be observed in IGView.
The documentation says that the reads in the mapping can be incompatible with the library type inferred since its done before filtering, however, my lib_format_counts.json file shows a compatible fragment ratio of 100%, so I don't think it's the problem :
{
"read_files": "( output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed1.fastq, output/trimmed/3963279a-4960-49a2-936a-a13bb4a7dde5/trimmed2.fastq )",
"expected_format": "IU",
"compatible_fragment_ratio": 1.0,
"num_compatible_fragments": 2737879,
"num_assigned_fragments": 2737879,
"num_frags_with_consistent_mappings": 1714617,
"num_frags_with_inconsistent_or_orphan_mappings": 1023262,
"strand_mapping_bias": 0.4979922629951762,
"MSF": 0,
"OSF": 0,
"ISF": 860751,
"MSR": 0,
"OSR": 0,
"ISR": 853866,
"SF": 526731,
"SR": 496531,
"MU": 0,
"OU": 0,
"IU": 0,
"U": 0
}
I've run RSEM on these samples for comparison purposes and the quantification of RSEM roughly match up the one done by Salmon, so I don't think the quantification is at fault here. I thought that maybe since this affects mostly an entire family of retroviruses with often homologous sequences, during the mapping phase the reads from one sequence are distributed to all similar sequences, and then it's corrected after ?
My question is, how can I validate or invalidate this hypothesis ? Or is there another reason that could explain this incoherence between the mappings and the quantification ?
Thanks in advance !
A quick and dirty explanation: if one transcript get a lot of reads but all of them multi-mappers, and a second location get the same lots of multi-mappers (they share all the multi-mappers) but also get a lot of uniquely mapped reads, the EM-algorithm used by Salmon will assign all the counts to the second transcript.
Also tagging Rob for a more elaborate and definitive answer.