I've been confused of my recent RNA-seq analysis result. Here is the situation: the RNA-seq results were obtained by Illumina pair-end 150bp sequencing, after trimming the adapters and low-quality bases, the clean reads were mapped to GRch38 genome, but the results showed that the unique mapping ratio is pretty low. So following the advises of the forum, I mapped the clean reads to rRNA sequences. However, the situation is totally reverse, the sample with higher rRNA proportion got higher unique mapping ratio instead. Can anyone gets any ideas of this or is there any other way I can try to find what are these multiple mapping reads? Thank in advance. Here shows the mapping result:
How are you quantifying rRNA fraction? The process is not necessarily straightforward. These previous discussions may be helpful:
Thanks a lot for your help. I did define the fraction of rRNA by mapping all clean reads to rRNA sequences.
Was this ribo-depleted RNA-seq? Have you looked at the results in IGV? Depending on the experiment you may just have a bunch of repeat expression.
Use these directions to properly post images: How to add images to a Biostars post
Can you tell us what program you used for "cleaning", alignments and for estimating the unique mapping?
Of course, I used the Trimmomatic for cleaning, and Hisat2 for alignment, and then use command
samtools view -q 20
to filter the unique mapped reads.Do you only have 4 samples?
Can you post HiSat2 parameters?
yes, I only have 4 samples. And the hiast2 command is :
hisat2 -p 16 -x $hisat_index -1 $R1_fq -2 $R2_fq -S $out_sam
Did you run FastQC and what were the results?
The FastQC result seemed fine, isn't it?
What are the overrepresented sequences that FastQC finds? And how do you determine the non-unique reads?
The overrepresented sequences are :
And I filtered the unique mapping reads use command:
samtools view -q 20
, the reads with mapping quality lower than 20 were determined as multiple mapping reads.The reason I asked about how you determined uniquely mapping reads is that the MAPQ value (which you are filtering on if you use
samtools view -q 20
) is usually not a good indicator of whether a reads is aligning to multiple places in the genome or not. What a MAPQ value of 20 actually means totally depends on the alignment software as this great post shows in great detail. Particularly for HiSat, the MAPQ does not seem to be the best way to identify uniquely aligning reads, based on this post I would guess that filtering forNH:i:1
might be your best bet.Or use
STAR
instead which has, IMO, a more transparent handling of multi-mappers.