Hello all,
I am running into an issue where some of the RNA-seq samples I am aligning with STAR are experiencing a high percentage of "reads unmapped: too short":
Started job on | Feb 05 18:42:50
Started mapping on | Feb 05 18:42:53
Finished on | Feb 05 19:13:34
Mapping speed, Million of reads per hour | 69.19
Number of input reads | 35383670
Average input read length | 296
UNIQUE READS:
Uniquely mapped reads number | 15655390
Uniquely mapped reads % | 44.24%
Average mapped length | 294.16
Number of splices: Total | 16143974
Number of splices: Annotated (sjdb) | 15814096
Number of splices: GT/AG | 15969708
Number of splices: GC/AG | 125515
Number of splices: AT/AC | 17520
Number of splices: Non-canonical | 31231
Mismatch rate per base, % | 0.17%
Deletion rate per base | 0.01%
Deletion average length | 1.94
Insertion rate per base | 0.01%
Insertion average length | 1.92
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 522992
% of reads mapped to multiple loci | 1.48%
Number of reads mapped to too many loci | 6751
% of reads mapped to too many loci | 0.02%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 18001046
% of reads unmapped: too short | 50.87%
Number of reads unmapped: other | 1197491
% of reads unmapped: other | 3.38%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
The read quality is excellent according to fastqc. I have tried relaxing the requirements on the mapped length, e.g.: --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3
as per the feedback on this post. If I lower these flags down to 0.1, I can substantially get rid of reads that are not mapping because they are too short and my Uniquely Mapped reads goes up to ~60%, but then I get a lot of multi-mapping reads (~%34). Anything above 0.1 is not sufficient to change the number of unmapped: too short reads
What can cause this many 'too short reads' to appear? I have read that it can be due to read quality (doesn't appear to be an issue) or mated pairs not be ordered the same (I tried aligning the individual reads separately an saw poor alignment for both). What other things can I look for or what else can I change when I run STAR?
You can logically see that reads from inserts that are short are likely to multi-map. There is no magical solution here, short of making new libraries. This is a characteristic of present libraries.
Consider using
salmon
instead of STAR so it can use statistics to distribute multi-mapping reads.I just tried salmon, and I am still encountering low mapping rates. I'm guessing this is something related to the library prep...
You may have contamination. Take a few of the unmapped reads and check them by blasting as suggested by @swbarnes2.