Dear Biostars-Community,
I'm fairly new to RNA-Seq and Bioinformatics in general (so please excuse if I use some terminology incorrectly) and have to analyze two datasets. I'm currently trying to use featureCounts. There I noticed that I get a very low and varying rate of "Successfully assigned alignments" that varies randomly in between either just over 30% or 3-9%.
The samples are from human pediatric gliomas. These aren't samples from my institute but samples/data from the ega-archive.
Dataset No. 1
paired: yes
stranded: yes
In dataset No. 1 the unassigned reads fall mostly into the category of "MultiMapping" and some into "NoFeatures" or "Ambiguity".
Dataset No. 1 I aligned/mapped myself with STAR as follows:
./STAR --runThreadN 20 --genomeDir STAR/Index --readFilesIn Datensätze_Python_sortiert/PA4/EGAF00000227710.txt.gz Datensätze_Python_sortiert/PA4/EGAF00000227711.txt.gz --readFilesCommand zcat --outFileNamePrefix Datensätze_Python_sortiert/PA4/STAR/PA4. --outSAMtype BAM SortedByCoordinate --sjdbGTFfile STAR/Homo_sapiens.GRCh38.98.gtf --quantMode GeneCounts
This was the result:
Mapping speed, Million of reads per hour | 312.76
Number of input reads | 73411821
Average input read length | 102
UNIQUE READS:
Uniquely mapped reads number | 58464253
Uniquely mapped reads % | 79.64%
Average mapped length | 101.58
Number of splices: Total | 13366347
Number of splices: Annotated (sjdb) | 13252300
Number of splices: GT/AG | 13224742
Number of splices: GC/AG | 101283
Number of splices: AT/AC | 11642
Number of splices: Non-canonical | 28680
Mismatch rate per base, % | 0.25%
Deletion rate per base | 0.01%
Deletion average length | 1.50
Insertion rate per base | 0.00%
Insertion average length | 1.28
MULTI-MAPPING READS:
Number of reads mapped to multiple loci | 13343040
% of reads mapped to multiple loci | 18.18%
Number of reads mapped to too many loci | 534395
% of reads mapped to too many loci | 0.73%
UNMAPPED READS:
Number of reads unmapped: too many mismatches | 0
% of reads unmapped: too many mismatches | 0.00%
Number of reads unmapped: too short | 1031175
% of reads unmapped: too short | 1.40%
Number of reads unmapped: other | 38958
% of reads unmapped: other | 0.05%
CHIMERIC READS:
Number of chimeric reads | 0
% of chimeric reads | 0.00%
Then I used the following code for featureCounts:
subread-2.0.1-Linux-x86_64/bin/featureCounts -T 3 -s 0 -p \
-a …STAR/Homo_sapiens.GRCh38.98.gtf \
-o …Files_pyega3/RNA-Seq/$i/featureCounts/${i}_s0_p.txt \
…Files_pyega3/RNA-Seq/$i/*.rnaseq.bam \
... and got those results:
PA4_s0_p.txt.summary:
Assigned 42229610
Unassigned_Unmapped 0
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 54258403
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 9054644
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 7179999
PA4_s1_p.txt.summary
Assigned 2786940
Unassigned_Unmapped 0
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 54258403
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 55561918
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 115395
PA4_s2_p.txt.summary
Assigned 42463622
Unassigned_Unmapped 0
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 54258403
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 10021812
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 5978819
Dataset No. 2
paired: yes
stranded: no? (according to the paper and allegedly used kit it isn't, I yet have to check that with infer_experiment.py)
In dataset No. 2 the unassigned reads fall mostly into the category of "NoFeatures", some into "MultiMapping" as well as "Unmapped" and little into "Ambiguity".
Dataset No. 2 I didn't map/align myself and bam-files were already given.
I used the exact same code for featureCounts as for dataset No. 1.
I got the following results:
LGG766_s0_p.txt.summary
Assigned 11628269
Unassigned_Unmapped 1801678
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 4651389
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 28606632
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 590109
LGG766_s1_p.txt.summary
Assigned 2778622
Unassigned_Unmapped 1801678
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 4651389
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 37951173
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 95215
LGG766_s2_p.txt.summary
Assigned 9536998
Unassigned_Unmapped 1801678
Unassigned_Read_Type 0
Unassigned_Singleton 0
Unassigned_MappingQuality 0
Unassigned_Chimera 0
Unassigned_FragmentLength 0
Unassigned_Duplicate 0
Unassigned_MultiMapping 4651389
Unassigned_Secondary 0
Unassigned_NonSplit 0
Unassigned_NoFeatures 31018259
Unassigned_Overlapping_Length 0
Unassigned_Ambiguity 269753
Maybe someone on here recognizes the pattern I'm getting or has some ideas on how to resolve the problem or to at least name/guess the problem, because I'm totally lost here.
Thanks a lot in advance!
Neat
Hello, could you solve it?. I have similar problem.
Please use add comment to add someting to existing question. and not Add answer.
I've moved their post to a comment this time.