I am currently using STAR to map several Hi-SEQ mRNA runs.
I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :)
In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome.
The unmapped bin that most of the reads fall into is "too short" . What parameter I might be mis-specifying? These are PE 101 Illumina reads, the quality is pretty good up until the ~85th base, and we have around 200M reads per sample.
What may be causing the unmapped reads: too short set of reads to be so large, I'd really appreciate it.
Thanks! Carmen
My command like is like so:
# $1 = READ1 fq file
# $2 = READ2 fq file
# $3 = PREFIX for Output Files [*.BAM]
/path/to/STAR \
--genomeDir /path/to/Fly/ \
--readFilesCommand 'zcat -fc' \
--readFilesIn $1 $2 \
--runThreadN 32 \
--genomeLoad LoadAndRemove \
--outFilterMultimapNmax 100 \
--outFilterMultimapScoreRange 2 \
--outSAMstrandField None \
--outSAMmode Full \
--outSAMattributes Standard \
--outSAMunmapped None \
--outFilterType BySJout \
--outStd SAM | samtools view -b -o $3_STAR.bam -S -
And all Final Logs look something like this:
Log.final.out
Started job on | May 20 22:49:23
Started mapping on | May 20 22:52:08
Finished on | May 21 05:18:10
Mapping speed, Million of reads per hour | 32.74
Number of input reads | 210640950
Average input read length | 202
UNIQUE READS:
Uniquely mapped reads number | 29841188
Uniquely mapped reads % | 14.17%
Average mapped length | 190.07
Number of splices: Total | 4405621
Number of splices: Annotated (sjdb) | 4123661
Number of splices: GT/AG | 4348429
Number of splices: GC/AG | 25290
Number of splices: AT/AC | 664
Number of splices: Non-canonical | 31238
Mismatch rate per base, % | 1.23%
Deletion rate per base | 0.03%
Deletion average length | 1.92
Insertion rate per base | 0.02%
Insertion average length | 2.41
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 36646260
% of reads mapped to multiple loci | 17.40%
Number of reads mapped to too many loci | 229494
% of reads mapped to too many loci | 0.11%
UNMAPPED READS:
% of reads unmapped: too many mismatches | 0.00%
% of reads unmapped: too short | 58.84%
% of reads unmapped: other | 9.49%
This is the command line I used to build the STAR index.
/path/to/STAR_2.3.0e/STAR \
--runMode genomeGenerate \
--genomeDir /path/to/Genomes/Fly/ \
--genomeFastaFiles /path/to/genomes/fly/dm3_genome.fa \
--runThreadN 16 \
--sjdbGTFfile /path/to/dm3_refGene_2011_02_15.gtf \
--sjdbGTFtagExonParentTranscript transcript_id \
--sjdbOverhang 100
When people have problems with a specific tool the best way is to ask on the tool developer's support page. They are the ones that understand subtle details about the tool
This is of course way too late but I guess that
--sjdbOverhang 100
should be--sjdbOverhang 201
for paired end reads this is mate length
(2*101) - 1
, as recommended in the manual.STAR's mapping rate is sensitive for this parameter which means that runs of genome generate are required for each read- length.
According to the STAR manual 2.15a available here
So in this user's case,
--sjdbOverhang 100
is fine.