Hi all, I am new to the platform.
I was wondering what the common/best practice is regarding building a Salmon index for bulk RNAseq analysis of human cells.
The tutorial for Salmon/Alevin is using the complete transcriptome from GENCODE (gencode.vM23.transcripts.fa.gz, can be seen from the lack of "pc_" in the file name) which presumably contains rRNA sequences.
However, other tutorials use a transcriptome from Ensembl which seems to contain only the cDNA (Homo_sapiens.GRCh38.cdna.all.fa.gz).
My research question is rather simple and looking for DEGs between treatments so I am not sure if I will be focusing on differences in ncRNAs. Also, "overrepresented sequences" and "per base GC content" from FastQC is consistent with a small rRNA contamination (I am currently running Salmon to see if the rRNA levels are similar across all samples).
I wonder which of the following is the common/best practice or if there is a better approach:
- Build full decoy aware index using full transcriptome (as in Salmon/Alevin tutorial)
- Build full decoy aware index using Protein-coding transcript sequences (GENCODE) or cDNA (Ensembl)
- Build full decoy aware index using full transcriptome, except removing rRNA sequences from the transcriptome and moving them to decoy sequences
- Similar to #1, except preceded by rRNA removal with BBDuk or SortMeRNA
Thanks in advance for your help!
Generally speaking you'll want to provide the full transcriptome plus the genome decoy as per your point 1.
Thanks for your reply! I proceeded with #1. After quantification, I looked at the reads that mapped to rRNA, rRNA-pseudogene, MT-rRNA and they were all minimal (<0.5% of total TPM of 1e6). Although there were slight differences amongst treatments, I don't think this will be an issue.