Question

STAR aligner can't map too short reads

1

Entering edit mode

22 months ago

Assa Yeroslaviz ★ 1.9k

For our Ribo-seq data set I tried the star aligner but was able to map only a very small fraction of the reads (<1% in some samples), while most of the reads a discarded for being too short.

What does it means for STAR? Where can I manage the minimum read length?

The ribo-seq data was first trimmed using cutadapt based on the Truseq adapter sequence. I than mapped it to the rRNA and kept only the unmapped reads to be later mapped against the transcriptome using STAR.

How can I increase the number of mapped reads?

Thanks

Assa

$ cat GSM3152885/GSM3152885.Log.final.out 
                             Started job on |       Jan 19 15:04:37
                         Started mapping on |       Jan 19 15:04:39
                                Finished on |       Jan 19 15:08:50
   Mapping speed, Million of reads per hour |       393.02

                      Number of input reads |       27402469
                  Average input read length |       40
                                UNIQUE READS:
               Uniquely mapped reads number |       173659
                    Uniquely mapped reads % |       0.63%
                      Average mapped length |       28.52
                   Number of splices: Total |       501
        Number of splices: Annotated (sjdb) |       0
                   Number of splices: GT/AG |       500
                   Number of splices: GC/AG |       0
                   Number of splices: AT/AC |       0
           Number of splices: Non-canonical |       1
                  Mismatch rate per base, % |       2.79%
                     Deletion rate per base |       0.00%
                    Deletion average length |       1.12
                    Insertion rate per base |       0.00%
                   Insertion average length |       1.00
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       281767
         % of reads mapped to multiple loci |       1.03%
    Number of reads mapped to too many loci |       12974
         % of reads mapped to too many loci |       0.05%
                              UNMAPPED READS:
Number of reads unmapped: too many mismatches |       0
   % of reads unmapped: too many mismatches |       0.00%
        Number of reads unmapped: too short |       26934058
             % of reads unmapped: too short |       98.29%
            Number of reads unmapped: other |       11
                 % of reads unmapped: other |       0.00%
                              CHIMERIC READS:
                   Number of chimeric reads |       0
                        % of chimeric reads |       0.00%

ribo-seq star • 4.2k views

ADD COMMENT • link updated 22 months ago by i.sudbery 20k • written 22 months ago by Assa Yeroslaviz ★ 1.9k

0

Entering edit mode

Looks like the reads length is 40 bp. Have you tried ungapped mapping e.g. bowtie v.1.x.

ADD REPLY • link 22 months ago by GenoMax 147k

1

Entering edit mode

22 months ago

swbarnes2 14k

"Too short" doesn't literally mean too short. It just means they didn't map.

ADD COMMENT • link 22 months ago by swbarnes2 14k

0

Entering edit mode

22 months ago

Carambakaracho ★ 3.3k

STAR has some cutoffs, discarding fragments shorter than those aligned. I can't recall the exact commands I used previously right now, but here's some suggestions from their bugtracker

https://github.com/alexdobin/STAR/issues/169#issuecomment-235881989

ADD COMMENT • link 22 months ago by Carambakaracho ★ 3.3k

score 2 · Accepted Answer · 2023-01-21

STAR is a local aligner. That means that a valid alignment doesn't have to over the whole length of the read, and might just cover a part of it. Clearly there needs to be limits to this, because if you had a 40 base read, then if you had an alignment matching only a single base within that, then it would multi-map to ~25% of positions in the genome. My understanding is that STAR will reporter longer alignments first, so if there is a valid 40 base alignment to your read then it will be reported, if there isn't then a valid 39 base alignment, if one existed, would be reported instead and so on until you hit STAR's limit for alignments to consider. If it reaches the limit without a valid alignment being found, then it is marked "unmapped: too short".

This limit is set by one or all of the parameters

--outFilterScoreMinOverLread: This is is the "score" of the alignment divided by the length of the read. Matches get postive scores, mismatches (within the part of the read that is aligned) get negative scores, as do various things to do with gaps. This the relationship between score and length, determining how good a short match has to be to make up for the fact that it is short.

--outFilterMatchNminOverLread: This is the number of bases in the read that are part of the alignment divided by the length of the read - i.e. what % of the read must be part of the alignment.

and --outFilterMatchNmin This is simply the raw minimum length of an alignment. If all your reads are the same length, this this amounts to the same as the above.

By default, STAR uses --outFilterMatchNminOverLread to only accept alignments that are two thirds of the length of the read. So for 40nt reads, this would be 26nt.

Note that for an alignment to be valid, it must not match to too many places. The most likely outcome of using a less stringent length filter is that instead of having high "unmapped: too short", you will have high "% of reads mapped to multiple loci", and most likely a high "% of reads mapped to too many loci".