Hi,
I am working with a RNAseq deriving from fungus and at the moment I have used Salmon for quantification. However, I have noticed that between STAR and Salmon, the mapping rate has lowered consistently. In fact, in Salmon, the mapping rate is almost the half than the one in STAR.
I could get higher mapping rate using fmd
compared to quasi
. Nevertheless, I think the percentage is still to low. I have been reading online that it might be due to contamination with rRNA and gDNA. Do you know how I could check that?
This is the code I used for indexing and for quantification:
source ~//miniconda3/etc/profile.d/conda.sh
conda activate Salmon
/media/bulk_01/users/guest21/miniconda3/envs/Salmon/bin/salmon index -t /media/scratchpad_01/guest21/Verticillium_dahliae_VdLs.17_ASM15067v2/data/GCA_000150675.2/VdLS17_gtf.fa \
-i /media/scratchpad_01/guest21/Verticillium_dahliae_VdLs.17_ASM15067v2/data/VdLS17_Indexed_Salmon \
fmd
/media/bulk_01/users/guest21/miniconda3/envs/Salmon/bin/salmon quant -i /media/scratchpad_01/guest21/Verticillium_dahliae_VdLs.17_ASM15067v2/data/VdLS17_Indexed_Salmon \
--libType A -1 /media/scratchpad_01/guest21/Output_RNAseq_30mpi/Vd_CTRL_30mpi_1/Trimmed_Data_Output/Vd_CTRL_R1_forward_paired.fastq.gz \
-2 /media/scratchpad_01/guest21/Output_RNAseq_30mpi/Vd_CTRL_30mpi_1/Trimmed_Data_Output/Vd_CTRL_R2_reverse_paired.fastq.gz \
-o /media/scratchpad_01/guest21/Output_RNAseq_30mpi/Salmon_CTRL_30mpi_1_transcripts.salmon \
--validateMappings
Note: as you can see I am not using decoys, as this is just a reccomended step by Salmon and I also do not know how to obtain the file.
Those two programs are not doing the same thing. Please see @Devon's answer here --> Alignment and mapping
If you have the genome sequence of your organism available, you simply append that at the end of your transcriptome. The process for doing that is described in
salmon
manual --> https://salmon.readthedocs.io/en/latest/salmon.html#preparing-transcriptome-indices-mapping-based-modeThanks for the link GenoMax. But I was wondering, should not alignment and mapping have a similar rate and thus be correlated? In my case I know that 90% of my read aligned to the genome (so I know precisely that there is a correspondence between my reads and the genome I use) but I only know that 45% are mapped to my genome (based on Devon's "where did they come from"). Does that mean that the reads that are not mapped to my trascriptome are not exons/coding genes? I know it might be a silly question, but I really would like to understand that.
Also, I have tried to update
salmon
to the newest version, 1.10.2 and the mapped reads slightly increased.Reads are likely aligning in regions where there is no expressed sequence known to be present. The reads could represent novel RNA's that you did not know about or at worse that could be contamination in your prep (DNA, that is an extreme example).
How would I check if these reads are aligning to regions where there is no expressed sequence know? Should I follow the suggestion of i.sudbery? For DNA contamination, isn't checked when the company performs RNAseq?
I would note that even on a good, polyA selected, RNA-seq run, I would only expect 60-75% of mapped reads to map to protein coding exons.
I'll try to make the decoy file again but I remember you also told me that it is only recommended and not mandatory for running
salmon
. I think I once tried to make the decoy file, but I had some names discrepancies between the files and thus it returned an error.I would start by checking what fraction of the reads mapped by STAR map to genes and not intergentic/intronic sequence.
For doing this, should I just look at the log.out.file?
Take your STAR alignment and sum all the counts for each gene. You can either do this by providing an annotation to STAR, or by running feature counts on the BAM file that comes out of STAR.
If you start with 30M reads, but when you add up all the counts that map to all the genes, and find only 15M reads, you'll know that 50% of your reads arn't mapping to annotated genes.
Oh, then I will check the output file from featureCounts.