Hello, I have pretty good quality rat RNASeq data (link to University storage with some stats). I'm getting 80% uniquely mapped, with most of the rest being too short (18.45%). This seems pretty okay to me, but I wanted to dig further, to learn if nothing else. BLASTing the unmapped pairs did give me a lot of Rattus norvegicus hits, so I tried to align R1 and R2 separately and I got this:
Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped
M/hr number length unique length MMrate multi multi+ MM short other
Read1: 916.7 1803310659 150 92.0% 147.6 0.5% 2.4% 0.1% 0.0% 5.4% 0.1%
Read2: 940.0 1802235648 150 91.8% 147.6 0.5% 2.4% 0.1% 0.0% 5.7% 0.1%
So at this point, while I surmised that 80% alignment isn't bad at all, I don't really want to settle for discarding up to ~12% of data. I then looked up the paired-end-related parameters and tried to up the --peOverlapMMp
from the default 0.01
to 0.1
, which made sense to me, and I've also seen 10% used in a few studies by others, and it didn't help much:
Speed Read Read Mapped Mapped Mapped Mapped Unmapped Unmapped Unmapped Unmapped
M/hr number length unique length MMrate multi multi+ MM short other
Mar 26 14:22:57 504.9 1809627203 300 80.0% 293.5 0.4% 1.4% 0.0% 0.0% 18.4% 0.1%
Does it make sense to loosen up this criterion? Maybe even further? In the manual, the only other paired-end-related parameter is --peOverlapNbasesMin
, would it make sense to touch that? I haven't tried this one as I have no idea what a sane value would be (150 bp x2 read length). Or is it something else entirely that I should take a look at?
EDIT: I just noticed that my --peOverlapMMp
attempt made no sense as the parameter doesn't work until mate realignment is enabled by setting --peOverlapNbasesMin
to something other than the default. The default is 0
, disabling the feature entirely regardless of the --peOverlapMMp
setting. I have now tried it and the results are similar to paired-end mapping without touching these two parameters, still a higher unmapped % than with single-end.
Have you used QC programs to look at the average insert size of your libraries?
Here is the output of Picard's
CollectInsertSizeMetrics
, from what I gathered, this seems to be okay:Look at the number of reads that have aligned and not the ones that have not. If this result was flipped then you need to be super worried. Otherwise move ahead with the other analysis.
"too short" does not mean the reads are actually short. See --> STAR aligner can't map too short reads
That's fair, but I just don't want to miss some alignments due to something simple that I perhaps should have spotted or done at some point in the process, or that I misconfigured in STAR parameters. My goal is also to use these samples as an opportunity to learn why, how, and at which points the default process is changed or adapted and this seemed like a point where optimisation of the process is possible, due to the generally high quality of samples and successful mapping of much of the remaining unmapped sequences with STAR and BLAST on single reads.
Most software programs are coded with default parameters that should work for majority of datasets (people sometimes forget that every parameter has a default value, unless it it changed). Changing those parameters should be done to account for known characteristics of your data/experiment and not to get better alignment stats. Since you are in learning mode you could try playing with the parameters and see how that affects the end result. Don't expect to get 100% alignments. If you do, be skeptical of that result.
Most aligners are not able to use single/paired-end data at the same time. You have to choose one or other. Using paired-end data is always going to be better since it gives you context/information about the size of that library fragment.
When you are using public datasets you have limited information available so it would be best to stay conservative. When you generate your own data, you will be able to make more informed decisions.
Agreed. Before I fully understand what I'm doing, I'm more inclined to trust the default parameters as I know I could always indiscriminately torture the parameters to get more hits, but who knows what my false positive rate would be, as you say regarding 100% alignments. This is why the discrepancy is interesting to me, it's happening with default STAR parameters, and it's still quite large (18% to 5% unmapped) between single and paired.
The data isn't public, but indeed my own from an experiment I conducted. Experiment design, all pharmacological treatments, in vivo testing, sacrifice, and brain extraction was performed by me and my colleagues so I have complete insight up until brain region homogenisation and RNA extraction, which was done by a collaborator, and Novogene performed the sequencing. EDIT: I suppose this is a large part of the problem, in that I have never performed sample preparation and sequencing myself so I can't match discrepancies in the data to steps in the process.