Question

STAR: Paired alignment gets ~18% unmapped (too short), but single reads get >90% mapping

0

Entering edit mode

5 days ago

Davor • 0

Hello, I have pretty good quality rat RNASeq data (link to University storage with some stats). I'm getting 80% uniquely mapped, with most of the rest being too short (18.45%). This seems pretty okay to me, but I wanted to dig further, to learn if nothing else. BLASTing the unmapped pairs did give me a lot of Rattus norvegicus hits, so I tried to align R1 and R2 separately and I got this:

                   Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Read1:             916.7  1803310659      150    92.0%    147.6     0.5%     2.4%     0.1%     0.0%     5.4%     0.1%
Read2:             940.0  1802235648      150    91.8%    147.6     0.5%     2.4%     0.1%     0.0%     5.7%     0.1%

So at this point, while I surmised that 80% alignment isn't bad at all, I don't really want to settle for discarding up to ~12% of data. I then looked up the paired-end-related parameters and tried to up the --peOverlapMMp from the default 0.01 to 0.1, which made sense to me, and I've also seen 10% used in a few studies by others, and it didn't help much:

                   Speed        Read     Read   Mapped   Mapped   Mapped   Mapped Unmapped Unmapped Unmapped Unmapped
                    M/hr      number   length   unique   length   MMrate    multi   multi+       MM    short    other
Mar 26 14:22:57    504.9  1809627203      300    80.0%    293.5     0.4%     1.4%     0.0%     0.0%    18.4%     0.1%

Does it make sense to loosen up this criterion? Maybe even further? In the manual, the only other paired-end-related parameter is --peOverlapNbasesMin, would it make sense to touch that? I haven't tried this one as I have no idea what a sane value would be (150 bp x2 read length). Or is it something else entirely that I should take a look at?

EDIT: I just noticed that my --peOverlapMMp attempt made no sense as the parameter doesn't work until mate realignment is enabled by setting --peOverlapNbasesMin to something other than the default. The default is 0, disabling the feature entirely regardless of the --peOverlapMMp setting. I have now tried it and the results are similar to paired-end mapping without touching these two parameters, still a higher unmapped % than with single-end.

paired-end star rnaseq • 452 views

ADD COMMENT • link 4 days ago by Davor • 0

1

Entering edit mode

Have you used QC programs to look at the average insert size of your libraries?

ADD REPLY • link 5 days ago by Trivas ★ 1.9k

0

Entering edit mode

Here is the output of Picard's CollectInsertSizeMetrics, from what I gathered, this seems to be okay:

MEDIAN_INSERT_SIZE   MODE_INSERT_SIZE     MEDIAN_ABSOLUTE_DEVIATION  MIN_INSERT_SIZE      MAX_INSERT_SIZE      MEAN_INSERT_SIZE     STANDARD_DEVIATION  READ_PAIRS
313                  279                  73                         49                   3807009              320.288099           144.839694          1474831749

PAIR_ORIENTATION     WIDTH_OF_10_PERCENT  WIDTH_OF_20_PERCENT        WIDTH_OF_30_PERCENT  WIDTH_OF_40_PERCENT  WIDTH_OF_50_PERCENT  READ_PAIRS          WIDTH_OF_60_PERCENT
FR                   27                   51                         79                   109                  147                  1474831749          203

WIDTH_OF_70_PERCENT  WIDTH_OF_80_PERCENT  WIDTH_OF_90_PERCENT        WIDTH_OF_95_PERCENT  WIDTH_OF_99_PERCENT  WIDTH_OF_60_PERCENT                      
543                  2965                 9047                       18599                80929                203

Picard insert size histogram

ADD REPLY • link 4 days ago by Davor • 0

1

Entering edit mode

I don't really want to settle for discarding up to ~12% of data

Look at the number of reads that have aligned and not the ones that have not. If this result was flipped then you need to be super worried. Otherwise move ahead with the other analysis.

"too short" does not mean the reads are actually short. See --> STAR aligner can't map too short reads

ADD REPLY • link 5 days ago by GenoMax 150k

0

Entering edit mode

That's fair, but I just don't want to miss some alignments due to something simple that I perhaps should have spotted or done at some point in the process, or that I misconfigured in STAR parameters. My goal is also to use these samples as an opportunity to learn why, how, and at which points the default process is changed or adapted and this seemed like a point where optimisation of the process is possible, due to the generally high quality of samples and successful mapping of much of the remaining unmapped sequences with STAR and BLAST on single reads.

ADD REPLY • link 4 days ago by Davor • 0

1

Entering edit mode

Most software programs are coded with default parameters that should work for majority of datasets (people sometimes forget that every parameter has a default value, unless it it changed). Changing those parameters should be done to account for known characteristics of your data/experiment and not to get better alignment stats. Since you are in learning mode you could try playing with the parameters and see how that affects the end result. Don't expect to get 100% alignments. If you do, be skeptical of that result.

successful mapping of much of the remaining unmapped sequences with STAR and BLAST on single reads.

Most aligners are not able to use single/paired-end data at the same time. You have to choose one or other. Using paired-end data is always going to be better since it gives you context/information about the size of that library fragment.

When you are using public datasets you have limited information available so it would be best to stay conservative. When you generate your own data, you will be able to make more informed decisions.

ADD REPLY • link 4 days ago by GenoMax 150k

0

Entering edit mode

Agreed. Before I fully understand what I'm doing, I'm more inclined to trust the default parameters as I know I could always indiscriminately torture the parameters to get more hits, but who knows what my false positive rate would be, as you say regarding 100% alignments. This is why the discrepancy is interesting to me, it's happening with default STAR parameters, and it's still quite large (18% to 5% unmapped) between single and paired.

The data isn't public, but indeed my own from an experiment I conducted. Experiment design, all pharmacological treatments, in vivo testing, sacrifice, and brain extraction was performed by me and my colleagues so I have complete insight up until brain region homogenisation and RNA extraction, which was done by a collaborator, and Novogene performed the sequencing. EDIT: I suppose this is a large part of the problem, in that I have never performed sample preparation and sequencing myself so I can't match discrepancies in the data to steps in the process.

ADD REPLY • link 4 days ago by Davor • 0