Question

Low Counts with featureCounts

0

Entering edit mode

4.3 years ago

Neat • 0

Dear Biostars-Community,

I'm fairly new to RNA-Seq and Bioinformatics in general (so please excuse if I use some terminology incorrectly) and have to analyze two datasets. I'm currently trying to use featureCounts. There I noticed that I get a very low and varying rate of "Successfully assigned alignments" that varies randomly in between either just over 30% or 3-9%.

The samples are from human pediatric gliomas. These aren't samples from my institute but samples/data from the ega-archive.

Dataset No. 1

paired: yes

stranded: yes

In dataset No. 1 the unassigned reads fall mostly into the category of "MultiMapping" and some into "NoFeatures" or "Ambiguity".

Dataset No. 1 I aligned/mapped myself with STAR as follows:

./STAR --runThreadN 20 --genomeDir STAR/Index --readFilesIn Datensätze_Python_sortiert/PA4/EGAF00000227710.txt.gz Datensätze_Python_sortiert/PA4/EGAF00000227711.txt.gz --readFilesCommand zcat --outFileNamePrefix Datensätze_Python_sortiert/PA4/STAR/PA4. --outSAMtype BAM SortedByCoordinate --sjdbGTFfile STAR/Homo_sapiens.GRCh38.98.gtf --quantMode GeneCounts

This was the result:

       Mapping speed, Million of reads per hour |   312.76

                          Number of input reads |   73411821
                      Average input read length |   102
                                    UNIQUE READS:
                   Uniquely mapped reads number |   58464253
                        Uniquely mapped reads % |   79.64%
                          Average mapped length |   101.58
                       Number of splices: Total |   13366347
            Number of splices: Annotated (sjdb) |   13252300
                       Number of splices: GT/AG |   13224742
                       Number of splices: GC/AG |   101283
                       Number of splices: AT/AC |   11642
               Number of splices: Non-canonical |   28680
                      Mismatch rate per base, % |   0.25%
                         Deletion rate per base |   0.01%
                        Deletion average length |   1.50
                        Insertion rate per base |   0.00%
                       Insertion average length |   1.28
                             MULTI-MAPPING READS:
        Number of reads mapped to multiple loci |   13343040
             % of reads mapped to multiple loci |   18.18%
        Number of reads mapped to too many loci |   534395
             % of reads mapped to too many loci |   0.73%
                                  UNMAPPED READS:
  Number of reads unmapped: too many mismatches |   0
       % of reads unmapped: too many mismatches |   0.00%
            Number of reads unmapped: too short |   1031175
                 % of reads unmapped: too short |   1.40%
                Number of reads unmapped: other |   38958
                     % of reads unmapped: other |   0.05%
                                  CHIMERIC READS:
                       Number of chimeric reads |   0
                            % of chimeric reads |   0.00%

Then I used the following code for featureCounts:

subread-2.0.1-Linux-x86_64/bin/featureCounts -T 3 -s 0 -p \
-a …STAR/Homo_sapiens.GRCh38.98.gtf \
-o …Files_pyega3/RNA-Seq/$i/featureCounts/${i}_s0_p.txt \
…Files_pyega3/RNA-Seq/$i/*.rnaseq.bam \

... and got those results:

PA4_s0_p.txt.summary:

Assigned    42229610 
Unassigned_Unmapped 0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 54258403
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   9054644
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    7179999

PA4_s1_p.txt.summary

Assigned    2786940
Unassigned_Unmapped 0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 54258403
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   55561918
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    115395

PA4_s2_p.txt.summary

Assigned    42463622
Unassigned_Unmapped 0
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 54258403
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   10021812
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    5978819

Dataset No. 2

paired: yes

stranded: no? (according to the paper and allegedly used kit it isn't, I yet have to check that with infer_experiment.py)

In dataset No. 2 the unassigned reads fall mostly into the category of "NoFeatures", some into "MultiMapping" as well as "Unmapped" and little into "Ambiguity".

Dataset No. 2 I didn't map/align myself and bam-files were already given.

I used the exact same code for featureCounts as for dataset No. 1.

I got the following results:

LGG766_s0_p.txt.summary

Assigned    11628269
Unassigned_Unmapped 1801678
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4651389
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   28606632
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    590109

LGG766_s1_p.txt.summary

Assigned    2778622
Unassigned_Unmapped 1801678
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4651389
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   37951173
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    95215

LGG766_s2_p.txt.summary

Assigned    9536998
Unassigned_Unmapped 1801678
Unassigned_Read_Type    0
Unassigned_Singleton    0
Unassigned_MappingQuality   0
Unassigned_Chimera  0
Unassigned_FragmentLength   0
Unassigned_Duplicate    0
Unassigned_MultiMapping 4651389
Unassigned_Secondary    0
Unassigned_NonSplit 0
Unassigned_NoFeatures   31018259
Unassigned_Overlapping_Length   0
Unassigned_Ambiguity    269753

Maybe someone on here recognizes the pattern I'm getting or has some ideas on how to resolve the problem or to at least name/guess the problem, because I'm totally lost here.

Thanks a lot in advance!

Neat

RNA-Seq featureCounts • 2.5k views

ADD COMMENT • link updated 3.5 years ago by badribio ▴ 290 • written 4.3 years ago by Neat • 0

0

Entering edit mode

Hello, could you solve it?. I have similar problem.

ADD REPLY • link 3.5 years ago by vanebioq • 0

0

Entering edit mode

Please use add comment to add someting to existing question. and not Add answer.

ADD REPLY • link 3.5 years ago by badribio ▴ 290

0

Entering edit mode

I've moved their post to a comment this time.

ADD REPLY • link 3.5 years ago by Ram 44k

score 0 · Answer 1 · 2021-06-01

Looks like you are using incorrect stand param -s 0 for stranded it should be -s 1 for dataset1

from the docs

Strandness

  -s <int>            Perform strand-specific read counting. Acceptable values:
                      0 (unstranded), 1 (stranded) and 2 (reversely stranded).
                      0 by default.

For the data set that you do not know the strand info i would convert the bam to fastq and use this tool and then use the right featurecount param