STAR mapping - regarding output files content
0
0
Entering edit mode
15 months ago
Manko47 ▴ 10

Hello, I've 2 questions regarding some of the outputs produced by STAR mapper in my RNA-seq experiment - particularly the .sequenceReadsPerGene.out file and .sequenceLog file.

  1. I'm summarising statistics regarding mapped/multimapped and unmapped reads etc. I counted the reported number of :

enter image description here

N_unmapped,

N_multimapped,

N_nofeature,

N_ambiguous

from sequenceReadsPerGene.out file (counted only the column most to the right since it's a reversely stranded experiment) as well as uniquely mapped reads from .sequenceLog file. However if I add them together then I'm well above 100% of mapped reads. Am I correctly assuming in this case that some of reads belong in multiple of those categories (for example uniquely mapped and noFeature). And in general what does STAR means with the noFeature category?

  1. What is the exact difference between a mapped read and alignment in STAR? I created the gene_count_matrix for differential gene_expression analyses both utilising Featurecounts as well as the sequenceReadsPerGene.out file straight from STAR (this is a single-end experiment and the results are identical). However the total number of alignments reported in the log files is well above the total number of reads. Am I correctly assuming that this is because STAR counts every read that mapped to multiple places in reference genome as multiple alignments? And therefore if I have 30% of reads that mapped to multiple loci and I used the defaultt parameters (so only those that mapped to more than 10 are disregarded as mapped to too many locus) then they can create a huge number of alignment, because if a read mapped to 5 places - then it will be counted as 5 alignments? I also assume that only the uniquely mapped reads are being assigned to the genes (and not all of them).
STAR RNA-seq mapping • 1.7k views
ADD COMMENT
1
Entering edit mode

Are you saying the sum of unmapped, multimapped, nofeature, ambiguous features is greater than the reported uniquely mapped reads? This would make sense since, unmapped reads are in addition to mapped reads.

If you have a multimapping read that maps to five locations, then yes, there will be five alignments. However, this won't be reflected in the *ReadsPerGene.out file since, a it assigns the multimapping read to the N_multimapping feature (so only counts as one).

But, your total alignments in the bam file would be greater than total mapped reads.

To understand what the noFeature category is, you need to understand what the ReadsPerGene.out file is reporting. ReadsPerGene.out simply counts the number of reads that overlap a user-supplied annotation file. Usually, this would be a GTF file with gene annotations. If a read mapped to an intergenic region, then, it would not overlap with any feature, so it would be in the noFeature category.

ADD REPLY
0
Entering edit mode

Thank you - the answer to my second question is exactly what I hoped for so that one is closed.

As for the first question - not exactly - I'm saying that if I sum the number of unmapped, multimapped, nofeature, ambiguous as well as uniquely mapped reads, then that number is higher than the total number of input reads. I'm adding additional photos with the exact counts. Is that fine? I'm assuming that's because some of the noFeature and ambiguous reads also belong to the category of multi mapped/uniquely mapped?.

P.S. As I mentioned I only counted the column most to the right from the second photo since it's a reversely stranded data

enter image description here

enter image description here

ADD REPLY
2
Entering edit mode

You need to understand that the gene count evaluation isn't really being done by STAR. It's an add-on algorithm (first seen in htseq-count) that happens after the alignment. So there is no reason to think that taking some numbers from one and some numbers from the other will add up to anythingmeaningful.

Specifically, when the unique mapped reads are counted, STAR has no idea if they are assignable to genes or not. Surely some of them are noFeature., or ambiguous.

The better thing to check is if the columns of the htseq-count type output all add up to what they should.

ADD REPLY
1
Entering edit mode

Yes that would be correct, some of the uniquely mapped reads would be counted as noFeature or ambiguous.

So by adding them to uniquely mapped reads, you are double counting them. Basically, if you sum the column for all features except N_unmapped and N_multimapping, you would get the Uniquely mapped reads number. This number plus the N_unmapped and N_multimapping will equal total input reads.

ADD REPLY
0
Entering edit mode

Manko47 : Please avoid posting screenshots. Using 101010 button allows you to post data as code which keeps its formatting.

ADD REPLY

Login before adding your answer.

Traffic: 1667 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6