featureCounts percentage of successful assigned alignments lower than expected
0
0
Entering edit mode
3.0 years ago
gt ▴ 30

Still relatively new to RNA-seq data. I have taken a single BAM file (filtered to proper pairs) for human paired-end RNA-seq data. Below is the output for STAR alignment.

                      Number of input reads |       172222031
                  Average input read length |       245
                                UNIQUE READS:
               Uniquely mapped reads number |       123631446
                    Uniquely mapped reads % |       71.79%
                      Average mapped length |       253.43
                   Number of splices: Total |       56672165
        Number of splices: Annotated (sjdb) |       55335338
                   Number of splices: GT/AG |       55607873
                   Number of splices: GC/AG |       478746
                   Number of splices: AT/AC |       52251
           Number of splices: Non-canonical |       533295
                  Mismatch rate per base, % |       0.51%
                     Deletion rate per base |       0.02%
                    Deletion average length |       1.75
                    Insertion rate per base |       0.02%
                   Insertion average length |       1.38
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |       38635060
         % of reads mapped to multiple loci |       22.43%
    Number of reads mapped to too many loci |       523608
         % of reads mapped to too many loci |       0.30%
                              UNMAPPED READS:
Number of reads unmapped: too many mismatches |       0
    % of reads unmapped: too many mismatches |       0.00%
        Number of reads unmapped: too short |       8598098
             % of reads unmapped: too short |       4.99%
            Number of reads unmapped: other |       833819
                 % of reads unmapped: other |       0.48%
                              CHIMERIC READS:
                   Number of chimeric reads |       0
                        % of chimeric reads |       0.00%

So I have 71.79% of the reads being uniquely mapped to the reference (not horrendous). However, when I look at the results of featureCounts, I have the following output.

|| Process BAM file Aligned.sortedByCoord.out.properPairs.bam...              ||
||    Strand specific : reversely stranded                                    ||
||    Paired-end reads are included.                                          ||
||    Total alignments : 401232754                                            ||
||    Successfully assigned alignments : 67882178 (16.9%)                     ||
||    Running time : 15.64 minutes                                            ||

Below is the more detailed output file from featureCounts.

Assigned        67882178
Unassigned_Unmapped     0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality       0
Unassigned_Chimera      0
Unassigned_FragmentLength       0
Unassigned_Duplicate    0
Unassigned_MultiMapping 277601308
Unassigned_Secondary    0
Unassigned_NonSplit     0
Unassigned_NoFeatures   50284182
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    5465086

Clearly, featureCounts is dividing the number of assigned alignments to the total alignments, including the multi-mapped reads. Should I instead take the number of alignments assigned divided by the total number of uniquely mapped reads? This would instead give me 55% of alignments being assigned. Is there a way to change the percentage assigned reported in featureCounts command? Also, is 55% assigned to exons reasonable for human RNA-seq data?

My STAR command is the following:

STAR --runThreadN 16 --readFilesCommand zcat --outFilterMultimapNmax 20 --genomeDir 'star_genome_index' --outSAMtype BAM SortedByCoordinate --outFileNamePrefix 'BAM/' --readFilesIn 'R1.fastq.gz' 'R2.fastq.gz'

My featureCounts command is the following:

featureCounts -p -t exon -g gene_id -s 2 -T 16 -a gencode.v38.annotation.gtf -o counts.txt Aligned.sortedByCoord.out.properPairs.bam

I did use the infer_experiment.py to check the strandedness of the library and the majority of the reads could not be inferred. I have tried both reverse stranded and un-stranded in the featureCounts command and get almost identical % assigned.

RNA-seq STAR featureCounts • 2.0k views
ADD COMMENT

Login before adding your answer.

Traffic: 2161 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6