Confused about % of mapped and unmapped reads output from STAR aligner
1
0
Entering edit mode
2.1 years ago
mohsamir2016 ▴ 30

I am quite new to STAR aligner, and have some confusion in the numbers of unmapped/mapped reads output from STAR:

I would like to know whether the STAR output BAM file if I do not use the argument (--outSAMunmapped within) is already filtered for unmapped, or duplicate reads or not ? or do I need to further filter it before variant calling?

The short story is that:

Assuming my file is file.bam, I have run STAR without the argument (--outSAMunmapped within), and I obtained BAM file. Looking into the log.final of that file produced % uniquely mapped reads 83.3%. If I look using the command

samtools view -c -f4 file.bam

it produced 0 reads, so unmapped also running samtools flagstat file.bam generated this image

enter image description here

so no duplicate, no unmapped reads so clean file. When I rerun the alignment by adding --outSAMunmapped within, and rerun samtools flagstat file.bam,

I could get this image

enter image description here

so mapping appeared as a %, but still the duplicate is 0

Based on that I assume that

  1. the BAM file produced from STAR if one do not use argument --outSAMunmapped within is a file that contains only mapped reads (not sure whether these are unique mapped or ?),
  2. if you add this argument, you get a BAM file that contain both mapped and unmapped but how about duplicate reads and mismatches.

Which statistics on the output bam file to be used in a paper or presentation?

RNAseq STAR • 3.8k views
ADD COMMENT
0
Entering edit mode

Thanks my dear: just to make sure I understood correctly:

If I run STAR command without including --outSAMunmapped without outFilterMultimapNmax nor outFilterMismatchNmax, so this is default STAR: I got statistics summary from the log.final file. my questions are:

  1. Are the number of input reads is exactly the same as the one that I have from trimmed versions of R1 and R2 summed up or?
  2. What exactly the Uniquely mapped reads number? Is that the n of reads that map to only 1 position on the genome OR that map to < 10 position OR the reads that do not have any other similar reads mapping to the position on the genome?
  3. Does the Uniquely mapped reads number include the multimpped reads that mapped to < 10 position (so basically include the 'Number of reads mapped to multiple loci' appeared on statistics summary?
  4. What do the reads with 255 mapping Q mean? Are these are the ones that map to only 1 place on the genome? So should they be equal to the number of Uniquely mapped reads number if it means mapped to 1 position (I tried that they are not the same? OR they are similar to the Uniquely mapped reads number if I set the outFilterMultimapNmax to 1?

Please could someone expert in STAR put an end to my confusion by elaborately help me understating these points?

Thanks

ADD REPLY
0
Entering edit mode

Read the STAR paper and maybe start an email conversation with Alex Dobin, the developer of STAR.

Also, please put some more effort into formatting your posts.

ADD REPLY
0
Entering edit mode

@swbarnes2

ADD REPLY
0
Entering edit mode

Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.

code_formatting

ADD REPLY
0
Entering edit mode
2.1 years ago

STAR will not ever set the flag for duplicate reads. If you want to know the # of duplicate reads, you'll need to put the bam through something like picard which will determine that and add the duplicate flag where appropriate.

You don't do anything with duplicate reads in RNASeq.

ADD COMMENT
0
Entering edit mode

Thanks, I have found a way to mark duplicates and remove them. I know that STAR did not do that, but this is actually samtools which produced report like in the image. So, I am surprised that no duplicates are in the file. Anyhow, Could you comment on that : the BAM file produced from STAR if one do not use argument (--outSAMunmapped within) is a file that ocntains only mapped reads (not sure whether these are unique mapped or ? ), so no need to filter it ?

Thanks

ADD REPLY
1
Entering edit mode

I have found a way to mark duplicates and remove them

You should not be doing this for RNAseq data unless you have UMI's.

ADD REPLY
0
Entering edit mode

Samtools flagstat is just reading the flags in the bam. It is up to whatever program makes or alters the bam to set those flags. STAR will not touch the duplicate flag at all; it will leave it at 0 for all reads. Other programs, like picard, will assess duplication and set that flag, or remove those reads. (Don't remove those reads for RNASeq)

The meaning of the outSAMunmapped setting is pretty obvious, and you can confirm how it works yourself.

ADD REPLY
0
Entering edit mode

Thanks. I understood the first part. but for me its meaning is not obvious, that is why I am asking. I tested this myself and explained this above in the post. Again, not including this option in the alignment command produced a BAM file that contains 0 unmapped reads as determined by the command

samtools view -c -f4 file.bam 

compared to a real % of mapped reads when I included it. So, here was my question for you my dears if I understood it right? Also the log.final file in the STAR output contain % of uniquely mapped reads..does this the same as % of mapped reads ?

Thanks

ADD REPLY
0
Entering edit mode

Uniquely mapped reads is clearly not the same as mapped reads. The STAR manual says very plainly that the unique mapped statistic it provides is not the same as the mapped statistic you get from samtools flagstat. You read the manual, why do you think the manual is wrong?

ADD REPLY
0
Entering edit mode

I agree with the manual. If both statistics is not the same, so which one of them to depend on as a final statististics on that sample ? for instance the STAR log final did not produce any info on the total n of mapped reads while the samtools can get this info...Shall I combine statistics from both ?

The point of duplicate reads: I am calling the variants in this RNA seq (not counting genes) so duplicate reads would biase the analyses towards over expressed genes? Am I right, so I need to remove them ?

Thanks

ADD REPLY
0
Entering edit mode

Hey, have you figured out? I'm also confused about this.

  1. The BAM file produced from STAR if one do not use argument --outSAMunmapped within is a file that contains only mapped reads (not sure whether these are unique mapped or ?)
  2. Based the first question, if it is not unique mapped file, should I use Picard to extract the uniquely mapped reads to do downstream analysis? or maybe no need for RNAseq, but should do for MeRIP-seq?

Best,
Lu

ADD REPLY

Login before adding your answer.

Traffic: 1662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6