Question

(STAR) What causes a high number of reads that are unmapped because they are too short?

0

Entering edit mode

14 months ago

becky.reese • 0

Hello all,

I am running into an issue where some of the RNA-seq samples I am aligning with STAR are experiencing a high percentage of "reads unmapped: too short":

                             Started job on |   Feb 05 18:42:50
                         Started mapping on |   Feb 05 18:42:53
                                Finished on |   Feb 05 19:13:34
   Mapping speed, Million of reads per hour |   69.19

                      Number of input reads |   35383670
                  Average input read length |   296
                                UNIQUE READS:
               Uniquely mapped reads number |   15655390
                    Uniquely mapped reads % |   44.24%
                      Average mapped length |   294.16
                   Number of splices: Total |   16143974
        Number of splices: Annotated (sjdb) |   15814096
                   Number of splices: GT/AG |   15969708
                   Number of splices: GC/AG |   125515
                   Number of splices: AT/AC |   17520
           Number of splices: Non-canonical |   31231
                  Mismatch rate per base, % |   0.17%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.94
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.92
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   522992
         % of reads mapped to multiple loci |   1.48%
    Number of reads mapped to too many loci |   6751
         % of reads mapped to too many loci |   0.02%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |   0.00%
        Number of reads unmapped: too short |   18001046
             % of reads unmapped: too short |   50.87%
            Number of reads unmapped: other |   1197491
                 % of reads unmapped: other |   3.38%
                              CHIMERIC READS:
                   Number of chimeric reads |   0
                        % of chimeric reads |   0.00%

The read quality is excellent according to fastqc. I have tried relaxing the requirements on the mapped length, e.g.: --outFilterScoreMinOverLread 0.3 --outFilterMatchNminOverLread 0.3 as per the feedback on this post. If I lower these flags down to 0.1, I can substantially get rid of reads that are not mapping because they are too short and my Uniquely Mapped reads goes up to ~60%, but then I get a lot of multi-mapping reads (~%34). Anything above 0.1 is not sufficient to change the number of unmapped: too short reads

What can cause this many 'too short reads' to appear? I have read that it can be due to read quality (doesn't appear to be an issue) or mated pairs not be ordered the same (I tried aligning the individual reads separately an saw poor alignment for both). What other things can I look for or what else can I change when I run STAR?

STAR rna-seq • 1.9k views

ADD COMMENT • link updated 14 months ago by GenoMax 151k • written 14 months ago by becky.reese • 0

1

Entering edit mode

You can logically see that reads from inserts that are short are likely to multi-map. There is no magical solution here, short of making new libraries. This is a characteristic of present libraries.

Consider using salmon instead of STAR so it can use statistics to distribute multi-mapping reads.

ADD REPLY • link 14 months ago by GenoMax 151k

0

Entering edit mode

I just tried salmon, and I am still encountering low mapping rates. I'm guessing this is something related to the library prep...

ADD REPLY • link 14 months ago by becky.reese • 0

0

Entering edit mode

You may have contamination. Take a few of the unmapped reads and check them by blasting as suggested by @swbarnes2.

ADD REPLY • link 14 months ago by GenoMax 151k

score 1 · Answer 1 · 2024-03-06

1

Entering edit mode

14 months ago

swbarnes2 14k

"too short" really means "they didn't map".

I'd find the most common unmapped reads, and blast them, see what organism they belong to.

ADD COMMENT • link 14 months ago by swbarnes2 14k

1

Entering edit mode

If I lower these flags down to 0.1, I can substantially get rid of reads that are not mapping because they are too short and my Uniquely Mapped reads goes up to ~60%, but then I get a lot of multi-mapping reads (~%34)

Looks like unmapped reads go down by changing the flag noted. But these reads then multi-map.

ADD REPLY • link 14 months ago by GenoMax 151k