I am using a script to analyze a large amount of ribo-seq data, which comes from different studies. Because from different research, I use trim_galore to remove the adaptor. Use bowtie2 to remove rRNA and use STAR mapping. After mapping, I found that the mapping rate of some of the data is normal, while some are extremely low, such as the following one.
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 9953452
% of reads unmapped: too many mismatches | 6.06%
Number of reads unmapped: too short | 153984837
% of reads unmapped: too short | 93.77%
Number of reads unmapped: other | 177470
% of reads unmapped: other | 0.11%
First I thought of adding parameters
--outFilterScoreMinOverLread 0 --outFilterMatchNminOverLread 0 --outFilterMatchNmin 0
, but it still didn't solve my problem.
Time Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped
M/hr number length unique length MMrate multi multi+ MM short other
Oct 09 13:54:14 224.7 3931595 54 3.3% 34.2 1.1% 0.0% 96.7% 0.0% 0.0% 0.0%
Oct 09 13:54:14 474.6 16216411 54 3.3% 34.2 1.0% 0.0% 96.7% 0.0% 0.0% 0.0%
The mapping rate after running the data with hisat2 default parameters is also similar.
Time loading forward index: 00:00:02 Time loading reference: 00:00:00
Multiseed full-index search: 00:00:04 1000000 reads; of these:
1000000 (100.00%) were unpaired; of these:
999572 (99.96%) aligned 0 times
169 (0.02%) aligned exactly 1 time
259 (0.03%) aligned >1 times
0.04% overall alignment rate
I checked the data again and confirmed that I did not mistake the species from which the data came. Why does this happen and how can I solve this problem?
Looks like many reads are "too short". Have you checked read length distribution with fastqc or something else ?
STAR's "too short" doesn't usually mean literally too short. It just means the reads didn't map. I'd pull out the most common unmapped reads and see what they are.
Yes, I checked one of the files SRR9971635 (fastq.gz file after rRNA removal) with fastqc, and found that the length of almost all reads is 40-60bp, which is too long for ribo-seq (This is another point that confuses me).
select one high quality sequence that you can BLAST or BLAT but doesn't align with STAR and copy-paste it here