Hi forum experts:
I am a first time STAR user, and also brand new to RNA-Seq mapping. And I seek some help with understanding my STAR ailgner results logfile, please. I understand STAR’s default parameters are optimized for mammalian genomes, so I should mention here that I am analyzing a plant genome / transcriptome.
Question 1 Specifically, what I find hard to understand is the section under under MULTI-MAPPING READS (that I am summarizing into 1 line):
Number of reads mapped to multiple loci | 0 = 0.00% Vs. Number of reads mapped to too many loci | 639919 = 3.72%
That is not normal, is it?
Question 2 For maximum and minimum intron size (that I did not specify in my test run), can I use the GTF file contents to empirically infer the shortest and longest annotated intron and use those instead of the default values? Or is there a better / more correct rationale to set these values?
The analyses steps I followed thus far, are detailed below. Any help in A. clarifying my confusion would be appreciated, and B. pointing out any other (potential) red flags in the logfile would also be helpful.
THANKS!
1. Raw reads pre-processed using BBmap pipeline (syntax not included)
Step 1-1. rename.sh = Rename SRR data files for BBmap compatibility
Step 1-2. clumpify.sh = Remove optical duplicates
Step 1-3. filterbytile.sh = Remove poor tiles in Illumina reads
Step 1-4. bbduk.sh = Adapter and quality trimming
Step 1-5. bbduk.sh = Filter out artifacts and PhiX spike-ins
Step 1-6. bbmap.sh = Filter out rRNA
Step 1-7. bbmap.sh = Filter out human and common bacterial contaminants
Step 1-8. bbsplit.sh = Split mapping to host genome versus symbiont genome
Step 1-9. Prepare final processed file as input for STAR aligner
2. Genome indexing using STAR
STAR --runThreadN 36 --runMode genomeGenerate --genomeDir \
./STAR_Genome_Index_MtrunA17r5.0-ANR/ --runDirPerm All_RWX --genomeFastaFiles \
./MtrunA17r5.0-ANR_Aug2019/MtrunA17r5.0-20161119-ANR.fasta --genomeChrBinNbits 18 \
--sjdbGTFfile ./MtrunA17r5.0-ANR_Aug2019/MtrunA17r5.0-ANR-EGN-r1.6.gtf --sjdbOverhang 99 --genomeSAindexNbases 13
3. RNA-Seq read mapping using STAR
STAR \
--runMode alignReads \
--runDirPerm All_RWX \
--genomeDir ./STAR_Genome_Index_MtrunA17r5.0-ANR/ \
--readFilesIn /home/aksrao/Mtr_Nod_Time_Course/FASTQ/SRR1726611/SRR1726611_BBmap_RawReadsProcessedFinal.fastq.gz \
--readFilesCommand zcat \
--outFileNamePrefix SRR1726611_BBmap_STARmapping/ \
--outTmpDir SRR1726611_BBmap_STARmapping_TEMP/ \
--outTmpKeep None \
--outFilterMultimapNmax 1 \
--outReadsUnmapped Fastx \
--outSAMtype BAM SortedByCoordinate \
--twopassMode Basic \
--runThreadN 24
STAR ALIGNMENT LOGFILE CONTENTS Started job on | Oct 10 17:50:37 Started mapping on | Oct 10 17:52:31 Finished on | Oct 10 17:54:14 Mapping speed, Million of reads per hour | 601.97
Number of input reads | 17223013
Average input read length | 99
UNIQUE READS:
Uniquely mapped reads number | 16275665
Uniquely mapped reads % | 94.50%
Average mapped length | 99.45
Number of splices: Total | 5603911
Number of splices: Annotated (sjdb) | 5587909
Number of splices: GT/AG | 5546577
Number of splices: GC/AG | 44931
Number of splices: AT/AC | 4167
Number of splices: Non-canonical | 8236
Mismatch rate per base, % | 0.36%
Deletion rate per base | 0.00%
Deletion average length | 1.36
Insertion rate per base | 0.00%
Insertion average length | 1.11
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 0
% of reads mapped to multiple loci | 0.00%
Number of reads mapped to too many loci | 639919
% of reads mapped to too many loci | 3.72%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 306711
% of reads unmapped: too short | 1.78%
Number of reads unmapped: other | 718
% of reads unmapped: other | 0.00%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
a non exhaustive, short comment: STAR dismisses reads mapped to "too many loci" while it keeps "mutlimappers". There's a parameter to control the "too many loci" threshold, I just don't know it by heart...
It's
--outFilterMultimapNmax
and it is set to 1. Increasing this parameter will report more multi mapping reads in your BAM file.Looks like default is
--outFilterMultimapNmax 10
,So should I increment it's value from 1 to 10, while checking for changes in logfile contents or some other output file (*.bam ?) ? And / Or is there a way to empirically determine what is best outFilterMultimapNmax value for a certain input file?
I'd like to remind you about the other questions in my original post well (intron min/max lengths, any other issues in the logfile), thank you !
Since you are mostly working with BBTools any reason why you chose to not use
bbmap.sh
for these alignments?