Hi All
I am new to salmon. Got 26% mapping rate with one set and 43% with another. Rest of samples have more than 75%. Although all the data have similar quality and all have adapter content empty in FASTQC. The quality is almost above 30 but with few bases at the beginning less than 20 but greater than 10. I did not trim them, read about how aggressive trimming is bad.
I also changed the k index during building the index. Changed the reference transcript. Same results. I blat the over representative sequences and found no contaminants. I mapped the sample that has salmon 26% with Tophat default parameters, got 73% alignment. Not sure why?
Not sure what else I can do? I can't ignore the 26 and 43% samples from the analysis. They are important.
Thank you
When you map the samples with TopHat, where are those reads that don't map with Salmon come from? You might consider including the rRNA sequences in the Salmon index you use to map the reads.
sorry Rob, did not get that. How can I include or not include rRNA sequences in salmon index?
Hi Tania,
Sure; let me try to describe it more completely. When you build a Salmon index, you construct it on some set of transcripts. That is, you obtain a fasta file of just transcript sequences using some tool like gff read or rsem-prepare-reference (along with a GTF and the genome), or, if your organism has a stand-alone fasta of transcript sequences (e.g., human, mouse, arabidopsis etc.), you download it directly (e.g., from gencode). Often, by default, these transcriptome files contain just coding transcripts---that is mRNAs that are believed to code for proteins . However, there is no guarantee that the purification in your protocol is 100% effective. It is quite possible (and often the case) that other transcripts (e.g., ribosomal RNAs or other types of RNA) make it into the sequenced sample. In this case, if those are not in your Salmon index, reads that come from them will simply appear as unmapped. Of course, these sequences are still in the _genome_, which is why a tool like TopHat might be able to find mappings for these reads. In this case, you can include the transcript sequences of these other RNAs in your Salmon index to see if they explain your reads (i.e., if your mapping rate goes up).
Great, thanks for the very clear explanation.
One final thing, how to detect those "other RNAs" from my samples, so I can append to the human transcript I use in building Salmon index?
Hi Tania,
Typically, you would simply include all of these RNAs in your index. In those samples where they are not present, Salmon will assign them an abundance of 0 (or very close to 0), but in those samples where you have a higher fraction of rRNA, they will show higher abundance. If you are building your transcript set (for indexing) from a gtf file, you can include these transcripts based on their biotype. If you are using a pre-packaged index (e.g., gencode), this information should appear in the header of the sequence, and can be used, starting from a comprehensive reference, to filter out those transcripts you don't want to include (e.g. various ncRNAs, etc.).
Thanks Rob, I will try working on this. Thanks much appreciated.