I used STAR for mapping my RNAseq data using STAR. I had used 2 pass mapping and had supplied GTF files. I am also interested in how the reads gets aligned to transcriptome. So, I used the following script during 2nd pass mapping and the script for 1st pass mapping is similar:
STAR --runThreadN 16 --runMode alignReads --genomeDir 2nDpassGenomeDir --readFilesIn Trm_PE-2ms01e_R1.fastq Trm_PE-2ms01e_R2.fastq --outFileNamePrefix 2nDpassAlignment/2ms01e/2ms01e --outFilterMultimapNmax 10 --outSAMmapqUnique Integer0to255 --outSAMtype BAM SortedByCoordinate --outReadsUnmapped Fastx --outSAMattributes All --alignIntronMin 10 --quantMode TranscriptomeSAM GeneCounts
When I look for the aligned bam file from STAR on IGV I was surprised to see that all the reads have very low mapping quality (0-3), infact reads with mapQ 0 are above 80%. The rest are mapQ 1 and 3. And this is the case for every gene.
I also ran awk to capture any reads with mapping quality above 5 and there were none.
Also, the bam file generated by aligning RNAseq reads to the transcriptome has lots of read with mapping quality 0 (zero).
For confirmation I compared the aligned genome sequence data (aligned using BWA from several samples) at the same regions for several genes and I see that there are lost of reads with mapQ above 40. Also, the reads from RNAseq and Genome sequence data share several variants at the same site.
What is causing the reads from RNAseq alignment using STAR to the same position to have such a low mapping quality?
Is something wrong with the script?
Is something different with how STAR assigns mapping quality?
or, is there some other problem?
Thanks in advance, - B
Did you take a small sample of reads and blasted them at NCBI to make sure they look correct (right genome)?
Genome is the right one. Its the same genome I used to align the genome resequence data using BWA.
The genome reseq data aligned by BWA at the same position have high mapping quality (above 40).
According to a recent blog post by Dr. Simon Andrews, STAR uses the same MAPQ scores as TopHat which would mean that all those reads are multi-mapping which explains why they have MAPQ scores of 0-3 (mapping from 2 to 10 or more locations).
Thanks genomax2. But, there should still be some uniquely mapping reads when I am using the whole genome. There is not one read that aligned and has mapping quality above 3 (this is the highest).
I am little surprised and confused why should the scoring of mapQ value be so different.
Don't literally use
Integer0to255
as the value for--outSAMmapqUnique
. Either use a single value, e.g.,--outSAMmapqUnique 50
, or don't bother with the argument (the default is 255). It's extremely likely thatInteger0to255
caused the "unique alignment" MAPQ to be set to 0.I think that could be the case. I will change the value for --outSAMmapqUnique and will update it on the forum.
Thanks,
Good observation. I see in the code where the integer range is tested but not the argument type. I don't know enough C/C++ to say what the behaviour would be in this case but your comment makes sense to me (and I'm sure you know more about that).
I won't profess to being a C++ aficionado, so I'm also not sure exactly how that's getting handled. I'm curious if this solves the problem though.
@Devon @ SES @ Wouter: Using --outSAMmapqUnique 60 solved the issue. But, there are reads with either mapQ value 60 or 0, 1, 2, 3. I hope STAR will do a better job of assigning mapping quality.
Also, it is stated in STAR manual to use --outSAMmapqUnique Interger0to255 for downstream GATK compatibility. I think this statement could be update to avoid any confusion.
The statement could be better written as --outSAMmapqUnique X - where X is an desired integer value from 0 to 255.
Personally, for me --outSAMmapqUnique Interger0to255 is confusing.
The manual also states (page 26):
What does the quality distribution of the FASTQ data look like? If that is good then you can debug the alignments. If not, then you'll want to go back and figure out what happened with the trimming or even with the library prep if the raw data is no good.
The FastQ data quality looks good. There were some enriched kmers even after quality trimming but not too much. But, based on the previous discussion on biostars I used the data (after it was trimmed).
I am posting some of the quality checks from fastqc after the data was trimmed.
Could you have a look at the mapping qualities in the first pass using STAR?
Its the same. I think the problem is with the scripts.