Question

44% Successfully Assigned Fragments with featureCounts after 85% uniquely mapped reads with STAR

0

Entering edit mode

6.3 years ago

garbuzov ▴ 70

Hi there, I'm wondering if anybody can shed some light into what is happening during the count table step with featureCounts. I am loosing more than half of my reads. My mapping statistics seem to be fine when I run STAR.

My library is 75bp paired end using the Nugen Ovation Universal kit. The RNA is from rat. I downloaded the NCBI genome and made the STAR index. Here is my command to run STAR:

STAR --runThreadN 12 \
--genomeDir <path to...>/genomes/rn6/ncbi/star \
--readFilesIn ${R1} ${R2} \
--outFileNamePrefix starMapped/${job_name} \
--outSAMtype BAM Unsorted \
--seedSearchStartLmax 40 \
--outFilterScoreMinOverLread 0.5 \
--outFilterMatchNminOverLread 0.5

My mapping rate is 84-89%. A representative Log.final.out:

 UNIQUE READS:
                        Uniquely mapped reads % |       85.59%
                          Average mapped length |       147.92
                       Number of splices: Total |       23011150
            Number of splices: Annotated (sjdb) |       19835290
                       Number of splices: GT/AG |       22429313
                       Number of splices: GC/AG |       180851
                       Number of splices: AT/AC |       22581
               Number of splices: Non-canonical |       378405
                      Mismatch rate per base, % |       0.28%
                         Deletion rate per base |       0.02%
                        Deletion average length |       1.98
                        Insertion rate per base |       0.01%
                       Insertion average length |       1.66
                             MULTI-MAPPING READS:
             % of reads mapped to multiple loci |       10.03%
             % of reads mapped to too many loci |       0.38%
                                  UNMAPPED READS:
       % of reads unmapped: too many mismatches |       0.00%
                 % of reads unmapped: too short |       3.55%
                     % of reads unmapped: other |       0.45%

Next, I run featureCounts using the following command:

featureCounts -T 12 -p -t exon -g gene_id -a <path to...>/NCBI/Annotation/Genes/genes.gtf -o combined_counts.txt *.bam

My output from featureCounts looks like:

Successfully assigned fragments : 41071240 (44.6%)

And this is representative of one sample in the summary file:

Assigned         41243743
Unassigned_Ambiguity    259701
Unassigned_MultiMapping 30155153
Unassigned_NoFeatures   20857145

My question is, why am I losing so many reads at the step of making the count table? Why are multi-mappers ~10% with STAR and then ~30% with featureCounts?

Thanks!

rna-seq alignment RNA-Seq featureCounts STAR • 6.8k views

ADD COMMENT • link updated 16 months ago by Thind amarinder ▴ 340 • written 6.3 years ago by garbuzov ▴ 70

0

Entering edit mode

6.3 years ago

garbuzov ▴ 70

Ok, I think I understand what you're saying. I was so busy comparing percentages I didn't look at read counts.

So, for STAR I get:

                  Number of input reads |       73019489
                            UNIQUE READS:
           Uniquely mapped reads number |       62360589

For featureCounts my # of assigned reads is:

 Assigned         41243743

And the total input is: ~100k fragments, so yes, the huge drop in percentage makes sense now. But I still have a substantial drop in the number of unique fragments from STAR to featureCounts: 62,360,589 -> 41,243,743. What could explain that? Thanks,

PS: And yes, my library is unstranded. I played around with the options. Adding -s 1 drops the assigned read count to 1%.

ADD COMMENT • link 6.3 years ago by garbuzov ▴ 70

0

Entering edit mode

Although I can't give you hard numbers, it is not uncommon to have a substantial drop between mapping rate and assignment to feature rate. It depends on several factors, and someone may chime in with more suggestions, but how good is the Rattus norvegicus annotation? In general, I consider human and mouse annotations to be of very high quality, with all other annotations being average at best - I am not familiar with the R. norvegicus annotation, though.

PS: And yes, my library is unstranded. I played around with the options. Adding -s 1 drops the assigned read count to 1%.

Then try with -s 2, because I don't think your library is unstranded. If your library is truly unstranded, an assigned rate of 1% is not realistic: one you expect half of the reads would map to each strand, thus half of the reads should have been assigned. This looks like a "reverse stranded" library incorrectly assigned as "forward stranded".

ADD REPLY • link 6.2 years ago by h.mon 35k

0

Entering edit mode

Wondering, if it was total RNAseq data? or ployA

ADD REPLY • link 16 months ago by Thind amarinder ▴ 340

score 2 · Accepted Answer · 2019-06-06

You are not showing a crucial information which would prove me right (or wrong): the number of input reads. But here is my guess:

The figure STAR is referring as "% of reads mapped to multiple loci" is in relation to the number of input reads. However, the number featureCounts refers as "Unassigned_MultiMapping" is in relation to number of mapped reads. If you have 10% of input reads that are multimappers, but each maps to 4 locations, based on featureCounts output you would think you have 30% multimappers.

P.S.: did you check if the Nugen Ovation you are using really results in an unstranded library? Because the featureCounts command you are issuing is considering your reads as belonging to an unstranded library.