I'm using rnaseq data (paired-end and strand-specific) which is rRNA depleted with Ribo-Zero Gold rRNA Removal Kit. I observed that most number of reads are intronic (genomic origin), which is due to Ribo-Zero.
I did the alignment with Hisat2
. Quality check of the alignment was done with RNA-seqQC
(qualimap) and samtools
flagstat stats.
Below I have given the results From RNA-seq QC and samtools. I don't get why the results from both are not the same?
And when I tried extracting read counts with featureCounts
.
featureCounts -a /documents/annotation_data/GRCh38/gencode.v27.annotation.gtf -F GTF -p -s 2 -T 8 -o sample01_RNAseq_cutadapt.txt sample01_RNAseq_cutadapt.sorted.bam
FeatureCounts output:
//========================== featureCounts setting ===========================\\
|| ||
|| Input files : 1 BAM file ||
|| P sample01_RNAseq_cutadapt.sorted.bam ||
|| ||
|| Output file : sample01_RNAseq_cutadapt.txt ||
|| Summary : sample01_RNAseq_cutadapt.txt.summary ||
|| Annotation : gencode.v27.annotation.gtf (GTF) ||
|| Dir for temp files : /documents/data/ ... ||
|| ||
|| Threads : 8 ||
|| Level : meta-feature level ||
|| Paired-end : yes ||
|| Multimapping reads : not counted ||
|| Multi-overlapping reads : not counted ||
|| Min overlapping bases : 1 ||
|| ||
|| Chimeric reads : counted ||
|| Both ends mapped : not required ||
|| ||
\\============================================================================//
//================================= Running ==================================\\
|| ||
|| Load annotation file gencode.v27.annotation.gtf ... ||
|| Features : 1200453 ||
|| Meta-features : 58288 ||
|| Chromosomes/contigs : 25 ||
|| ||
|| Process BAM file sample01_RNAseq_cutadapt.sorted.bam... ||
|| Paired-end reads are included. ||
|| Strand specific : reversely stranded ||
|| ||
|| WARNING: reads from the same pair were found not adjacent to each ||
|| other in the input (due to read sorting by location or ||
|| reporting of multi-mapping read pairs). ||
|| ||
|| Pairing up the read pairs. ||
|| ||
|| Total alignments : 85057355 ||
|| Successfully assigned alignments : 15837729 (18.6%) ||
|| Running time : 6.59 minutes ||
RNA-Seq QC report (Qualimap) after the alignment:
RNA-Seq QC report
-----------------------------------
>>>>>>> Input
bam file = /documents/Hisat2_bamOutputs/sample01_RNAseq_cutadapt.sorted.bam
gff file = /documents/annotation_data/GRCh38/gencode.v27.annotation.gtf
counting algorithm = uniquely-mapped-reads
protocol = strand-specific-reverse
>>>>>>> Reads alignment
reads aligned (left/right) = 40,475,469 / 39,127,135
read pairs aligned = 38,373,732
total alignments = 164,740,444
secondary alignments = 85,137,840
non-unique alignments = 56,300,532
aligned to genes = 2,341,222
ambiguous alignments = 143,275
no feature assigned = 10,530,260
not aligned = 3,496,915
>>>>>>> Reads genomic origin
exonic = 2,341,222 (18.19%)
intronic = 9,502,317 (73.82%)
intergenic = 1,027,943 (7.99%)
overlapping exon = 344,632 (2.68%)
>>>>>>> Transcript coverage profile
5' bias = 0.79
3' bias = 0.84
5'-3' bias = 0.92
And this is the result from samtools stats
# Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN raw total sequences: 83099498
SN filtered sequences: 0
SN sequences: 83099498
SN is sorted: 1
SN 1st fragments: 41549807
SN last fragments: 41549691
SN reads mapped: 79602782
SN reads mapped and paired: 77446534 # paired-end technology bit set + both mates mapped
SN reads unmapped: 3496716
SN reads properly paired: 76747534 # proper-pair bit set
SN reads paired: 83099498 # paired-end technology bit set
SN reads duplicated: 0 # PCR or optical duplicate bit set
SN reads MQ0: 496791 # mapped and MQ=0
SN reads QC failed: 0
SN non-primary alignments: 85137861
SN total length: 10934613644 # ignores clipping
SN total first fragment length: 5462252543 # ignores clipping
SN total last fragment length: 5472361101 # ignores clipping
SN bases mapped: 10441500710 # ignores clipping
SN bases mapped (cigar): 10424391662 # more accurate
SN bases trimmed: 0
SN bases duplicated: 0
SN mismatches: 26253838 # from NM fields
SN error rate: 2.518501e-03 # mismatches / bases mapped (cigar)
SN average length: 131
SN average first fragment length: 131
SN average last fragment length: 132
SN maximum length: 147
SN maximum first fragment length: 147
SN maximum last fragment length: 147
SN average quality: 38.4
SN insert size average: 682.5
SN insert size standard deviation: 1667.3
SN inward oriented pairs: 38345988
SN outward oriented pairs: 118451
1) I see that only 18.6% reads were successfully assigned in featureCounts output. Why there are very low in number?
2) Why the RNA-seq QC and samtools results are different?
Any help is appreciated. thanq
I see that higher intronic mapping rate is see with rRNA removal. Here wee used Ribo Zero tool kit. Equal distribution of reads mapping to intronic, exonic and intergenic regions suggests that there is DNA contamination. I got this information Fromm here rna-seq QC tutorial same seen here too [https://link.springer.com/article/10.1186/1471-2164-15-675]
I have data from 300 tumor samples. And for all I see that genomic origin of reads are higher in intronic region.
Have you visually inspected the alignments? Just because a paper said something it may not be applicable to your data. It is certainly possible that the rRNA removal did not work as expected. If that is the case then you should explicitly use the rDNA repeat for human (or something like SortMeRNA) to see if your data has rRNA contamination by doing those alignments.
I have 600 fastq.gz files. Could you please show me one example how to use
SortMeRNA
on .fastq.gz file please. thanqProgram manual for SortMeRNA is detailed and gives necessary examples. Start with a sample or two first. After you understand the usage think of analyzing remaining data.