Hello! I have 86 bp single reads from Illumina NextSeq500. Library preparation was carried out with the TruSeq stranded total RNA (Ribo-zero) kit for RNA extracted from mouse embryos. I've mapped the reads to the mm10 reference genome (chr1-19, chrX, chrY, chrM) with the subjunc junction-mapping aligner from the Rsubread software package (default settings). The mapping rate is only ~50% with raw or quality trimmed reads. I'd be glad to hear your ideas as to why.
Please inspect the Fastqc report of my raw reads yourself, if you wish to: https://drive.google.com/file/d/0B0NZ5u2nKR2qeG14Q25WSXFXNjQ/view?usp=sharing
The report is what one would expect from Illumina sequencing, I think. The slightly over-represented sequences (1,4% in total) are small nuclear RNAs according to a BLAST search. I tried fastq_quality_trimmer
from the fastx toolkit to trim 3' bases (quality threshold was set to 20). According to the Fastqc report some of the bases in the middle of the reads are poorer quality (<20, lower whiskers of the boxplots) - could this be affecting the mapping? Should I use more stringent trimming or even filtering based on overall sequence quality?
Thanks in advance.
UPDATE: I got 80% unique mapping rate with STAR 2.4. It seems the mismatch rate of my samples is a bit on the high side (2.4% per base according to STAR output). According to samstat 23% of my uniquely mapped reads have at least 4 mismatches. Rsubread and TopHat2 are more conservative regarding mismatches, it would seem. I've yet to try BBMap.