Hi everybody,
I'm reading the steps of generation of read counts in an RNA-seq study which are as follows:
"We used Casava 1.8.2 (Illumina) to demultiplex and process the raw reads into fastq files.
RNA-seq reads in FASTQ files were aligned to hg19 using Tophat 2.0.0
Reads were assigned to annotated genes and counted using the "summarizeOverlaps" function from the GenomicRanges Bioconductor package in "IntersectionNotEmpty" mode (TODO cite GenomicRanges: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003118 ).
Overlaps were computed to features of type "exon" with a Parent feature has type "mRNA" (excluding "ncRNA" in particular), grouped by gene ID (the Parent property of the mRNA feature to which each exon belongs)."
I can understand the first three steps, but I have a hard time understanding why overlaps were computed to features of type "exon"...
What is the last step?
Aren't the first three steps enough to calculate read counts?
Thanks
I think the last step is just explaining in more detail the kind of summarization done with summarizeOverlaps(). Take a look at the summarizeOverlap vignette in the GenomicAlignments package.
Indeed, the last paragraph seems to discuss which features were selected to perform counting on: overlaps on exons from mRNA but not ncRNA and grouped by gene ID (gene name)
So does it mean the first three steps are enough to count reads?
Yes. But in order to replicate their results you need to consider the last sentence, as it tells you how exactly the counting was done. Different counting strategies may give you different counts.
WikiPedia (Alternative_splicing) to the rescue.