Without -O flag

Question

RNA-Seq Read Counting using featureCounts

0

Entering edit mode

5.5 years ago

deepak18 ▴ 10

I am new to RNA-seq analysis of paired end read data and I was performing read counting using featureCounts. The use of -O flag increases the number of assigned reads drastically (difference of more than 20 million assigned reads). I don't quite understand the use of this option. In the manual it says that it "Assign reads to all their overlapping meta-features (or features if -f is specified)". Please can someone elaborate on this. Thanks in advance

RNA-Seq featureCounts subread • 6.1k views

ADD COMMENT • link 5.5 years ago by deepak18 ▴ 10

score 1 · Answer 1 · 2019-11-07

In the featureCounts manual, it said

By default, featureCounts does not count multi-overlapping reads

But after specifying -O flag

each overlapping meta-feature/feature receives a count of 1 from a read

So first, multi-overlapping reads will be counted; Second, more features overlapped by one read, more times it will be counted. I think that's why the number of assigned reads drastically

score 0 · Answer 2 · 2019-11-07

0

Entering edit mode

5.5 years ago

deepak18 ▴ 10

Below is the results summary:

Without -O flag

Assigned    4515031
Unassigned_Unmapped 0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4898882
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   1220228
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    24695226

With -O flag

Assigned    29210257
Unassigned_Unmapped 0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4898882
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   1220228
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    0

No. of ambiguous reads is more than 20 million. The code that I have used is:

featureCounts -p -f  -O -a /home/erpl/RNA-seq_Alignment_tools/star/indexing/Homo_sapiens.GRCh38.98.gtf -o f_counts_T24_1.txt /home/erpl/RNA_seq_Novogene/RNA_Sequencing_Novogene_Results/output_17.10.19/T24_1Aligned.sortedByCoord.out.bam

Alignment was performed with STAR.

STAR --runThreadN 12 --runMode genomeGenerate --sjdbGTFfile Homo_sapiens.GRCh38.98.gtf --genomeDir /home/erpl/star/indexing --genomeFastaFiles Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
STAR --runThreadN 20 --genomeDir /home/erpl/RNA-seq_Alignment_tools/star/indexing --sjdbGTFfile /home/erpl/RNA-seq_Alignment_tools/star/indexing/Homo_sapiens.GRCh38.98.gtf --readFilesIn C24_1_1.fq C24_1_2.fq --outSAMtype BAM SortedByCoordinate --outFileNamePrefix /home/erpl/RNA_seq_Novogene/RNA_Sequencing_Novogene_Results/output_17.10.19/C24_1

Is there a problem with the way I am performing this analysis or is there a problem with the library?

ADD COMMENT • link 5.5 years ago by deepak18 ▴ 10

1

Entering edit mode

Sounds like a large proportion of your reads map to multiple genes. Do you have rRNA contamination in your data? If so those exist in hundreds of genes and would amplify the total counts (since each feature receives a count). Try doing a log-log count plots of the two quantifications to see the real difference.

Alternatively featureCounts support fractional counting of multi-mappers (so 1/N reads are assigned where N is the number of matches).

But if you are interested in quantification you might want to read this - basically featureCounts might not be the better option for you...

ADD REPLY • link 5.5 years ago by Kristoffer Vitting-Seerup ★ 4.2k

0

Entering edit mode

But then why STAR statistics tell a different story? Over 90% reads had mapped uniquely and only about 3% were muti-mapping.

Number of input reads |   32604356
                  Average input read length |   300
                                UNIQUE READS:
               Uniquely mapped reads number |   30430485
                    Uniquely mapped reads % |   93.33%
                      Average mapped length |   297.43
                   Number of splices: Total |   38026407
        Number of splices: Annotated (sjdb) |   37697238
                   Number of splices: GT/AG |   37644236
                   Number of splices: GC/AG |   296935
                   Number of splices: AT/AC |   38377
           Number of splices: Non-canonical |   46859
                  Mismatch rate per base, % |   0.47%
                     Deletion rate per base |   0.01%
                    Deletion average length |   1.84
                    Insertion rate per base |   0.01%
                   Insertion average length |   1.71
                         MULTI-MAPPING READS:
    Number of reads mapped to multiple loci |   1060773
         % of reads mapped to multiple loci |   3.25%
    Number of reads mapped to too many loci |   7889
         % of reads mapped to too many loci |   0.02%
                              UNMAPPED READS:

I had previously done this analysis with DEXSeq without any difficulties but then I switched to different tools because only very few exons showed significant difference between 2 experimental conditions.

ADD REPLY • link 5.5 years ago by deepak18 ▴ 10

0

Entering edit mode

Multiple mapping reads are not the same thing as the reads overlapping multiple features, a read can be mapped to a unique position but there could be some overlapping transcripts which caused this read overlapped by multiple transcripts (features)

ADD REPLY • link 5.5 years ago by Jianyu ▴ 580

0

Entering edit mode

If you want more help from me I would need to see the log-log plot of the two quantifications.

ADD REPLY • link 5.5 years ago by Kristoffer Vitting-Seerup ★ 4.2k