Question

Why Am I Getting So Many Unmapped Reads In Star, Classified As "Too Short"?

6

Entering edit mode

12.2 years ago

gaelgarcia05 ▴ 280

I am currently using STAR to map several Hi-SEQ mRNA runs.

I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :)

In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome.

The unmapped bin that most of the reads fall into is "too short" . What parameter I might be mis-specifying? These are PE 101 Illumina reads, the quality is pretty good up until the ~85th base, and we have around 200M reads per sample.

What may be causing the unmapped reads: too short set of reads to be so large, I'd really appreciate it.

Thanks! Carmen

My command like is like so:

# $1 = READ1 fq file
# $2 = READ2 fq file
# $3 = PREFIX for Output Files [*.BAM]

/path/to/STAR \
  --genomeDir /path/to/Fly/ \
  --readFilesCommand 'zcat -fc' \
  --readFilesIn $1 $2 \
  --runThreadN 32 \
  --genomeLoad LoadAndRemove \
  --outFilterMultimapNmax 100 \
  --outFilterMultimapScoreRange 2 \
  --outSAMstrandField None \
  --outSAMmode Full \
  --outSAMattributes Standard \
  --outSAMunmapped None \
  --outFilterType BySJout \
  --outStd SAM | samtools view -b -o $3_STAR.bam -S -

And all Final Logs look something like this:

Log.final.out

Started job on |    May 20 22:49:23
                         Started mapping on |    May 20 22:52:08
                                Finished on |    May 21 05:18:10
   Mapping speed, Million of reads per hour |    32.74

                      Number of input reads |    210640950
                  Average input read length |    202
                                UNIQUE READS:
               Uniquely mapped reads number |    29841188
                    Uniquely mapped reads % |    14.17%
                      Average mapped length |    190.07
                   Number of splices: Total |    4405621
        Number of splices: Annotated (sjdb) |    4123661
                   Number of splices: GT/AG |    4348429
                   Number of splices: GC/AG |    25290
                   Number of splices: AT/AC |    664
           Number of splices: Non-canonical |    31238
                  Mismatch rate per base, % |    1.23%
                     Deletion rate per base |    0.03%
                    Deletion average length |    1.92
                    Insertion rate per base |    0.02%
                   Insertion average length |    2.41
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |    36646260
         % of reads mapped to multiple loci |    17.40%
    Number of reads mapped to too many loci |    229494
         % of reads mapped to too many loci |    0.11%
                              UNMAPPED READS:
   % of reads unmapped: too many mismatches |    0.00%
             % of reads unmapped: too short |    58.84%
                 % of reads unmapped: other |    9.49%

This is the command line I used to build the STAR index.

/path/to/STAR_2.3.0e/STAR \
  --runMode genomeGenerate \
  --genomeDir /path/to/Genomes/Fly/ \
  --genomeFastaFiles /path/to/genomes/fly/dm3_genome.fa \
  --runThreadN 16 \
  --sjdbGTFfile /path/to/dm3_refGene_2011_02_15.gtf \
  --sjdbGTFtagExonParentTranscript transcript_id \
  --sjdbOverhang 100

RNA-seq STAR • 16k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 12.2 years ago by gaelgarcia05 ▴ 280

0

Entering edit mode

When people have problems with a specific tool the best way is to ask on the tool developer's support page. They are the ones that understand subtle details about the tool

ADD REPLY • link 12.2 years ago by Istvan Albert 103k

0

Entering edit mode

This is of course way too late but I guess that --sjdbOverhang 100 should be --sjdbOverhang 201

for paired end reads this is mate length (2*101) - 1, as recommended in the manual.

STAR's mapping rate is sensitive for this parameter which means that runs of genome generate are required for each read- length.

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by Michael 56k

3

Entering edit mode

According to the STAR manual 2.15a available here

--sjdbOverhang specifies the length of the genomic sequence around the annotated junction to be used in constructing the splice junctions database. Ideally, this length should be equal to the ReadLength-1, where ReadLength is the length of the reads. For instance, for Illumina 2x100b paired-end reads, the ideal value is 100-1=99. In case of reads of varying length, the ideal value is max(ReadLength)-1. In most cases, the default value of 100 will work as well as the ideal value.

So in this user's case, --sjdbOverhang 100 is fine.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 9.5 years ago by JHumphrey ▴ 30

Ram · Answer 1 · 2015-06-30

1

Entering edit mode

10.1 years ago

cyril-cros ▴ 950

Try to clip your reads to the 85th base using seqtk and run STAR again (STAR + 32 Core => it will be really fast). No need to sort your reads. Quite often, the end quality of RNASeq reads are not so good and they get rejected due to mismatch with the genomic sequences.

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.1 years ago by cyril-cros ▴ 950

Ram · Answer 2 · 2017-12-13

0

Entering edit mode

7.7 years ago

Viswanathan • 0

This issue on STAR offers some suggestions to improve mapping of reads.

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 7.7 years ago by Viswanathan • 0