Hi all,
Sorry for the very basic question. Since i mapped my sequencing files to genome i could see more than 40% reads mapping to the intronic region. So i would like to extract the counts of introns. So what i did was, i used featurecounts to extract the gene counts with following command
featureCounts -F GTF -p -t gene -g gene_id -p -P -B -C -s 2 -d 1 -D 25000 -T 6 -a Homo_sapiens.GRCh38.86.gtf -o genecounts.csv file.bam
And the for the exons
featureCounts -F GTF -p -t exon -g gene_id -p -P -B -C -s 2 -d 1 -D 25000 -T 6 -a Homo_sapiens.GRCh38.86.gtf -o exoncounts.csv file.bam
Now i have gene body counts and exon counts so my idea is genecounts - exoncounts = introncounts. To verify this i build intron file by merging overlapping exons of the transcripts for a given gene. Then i extracted the counts.
featureCounts -F SAF -p -P -B -C -s 2 -d 1 -D 25000 -T 6 -a Intron.txt -o introncounts.csv file.bam
When i subtracted the genecounts - exoncounts i'm getting different counts compare to introncounts. Why it is like this? I would assume to have same counts like genecounts - exoncounts = counts. The one possible problem what i could think of is that during the exon count the overlapping exons might have been discarded by featurecounts right? What else could be else the problems are? How can i sort this out
Example
ENSG00000279457 515 - Exoncount
ENSG00000279457 1262 - Genecount
ENSG00000279457 938 - Introncount
It's probably easier to add introns to your GTF using AGAT and then quantify on those introns instead of exons with featureCounts.
Actually the tools is missing certain exons in the middle and calling as a intron. This is the file i constructed as a intron
But when i use AGAT tool
I have attached a IGV screen shot for your reference. Is that normal?
It sounds like the third line is an intergenic region, not intronic
It is intronic, Can i ask you why? From my point of view, there is a small exon-ENSE00001799933 missing in the refseq annotation. There are 2 transcripts for the given gene and overlapping exons were mergerd.
I don't get how you ended with this result so I checked using Ensembl Homo_sapiens.GRCh38.86.gtf file.
The gene is like that:
There are two transcript and what you show for AGAT seems to correspond to the first one, 3 exons => 2 introns
Here the result with AGAT v1.0.0
and (it does not fit in one message)
You guys bring tears in my eyes..Anyway to the point i may not explained properly. for the given gene if transcripts exons overlaps, i merged those exons for the simplification. so ,
The transcript_1-exon-1 can overlap with transcript_2-exon-1, transcript_2-exon-2 and
transcript_1-exon-2 takes transcript_2-exon-3
transcript_1-exon-3 again overlap with transcript_2-exon-5, transcript_2-exon-6
Among these transcript_2-exon-4 is missing so in my file that is considered and eventually became
1 12721 12974 ENSG00000223972 DDX11L1 +
Is that correct? am i wrong?
The first rule of any counting process in bioinformatics is that:
Numbers never add up.
When I do any counting with one method,
samtools view -c
over a range, then I do it with a different tool,bedtools
, then I do withfeaturecounts
, the numbers will be close but never precisely the same. Each tool may apply various correction, makes assumptions etc depending on the input. I would not lose too much sleep over it.Thanks for your suggestion, i wanna ask one more question as well. I divided my RNA-Seq data into three parts as Intron counts, Exon counts and Gene counts. and i made DESeq counts for all of them in a WT and KO situation. when we looked the intron counts , My post-doc is saying that introns are much longer than exons. In most of the situation introns can accommodate more reads in horizontal way but not in vertical way like exons (like a peak). He said that, it may not be significant because the reads are distributed across the intron so he wanted me to normalize the intron by length. In general i assumed introns are another feature of gene and it can be treated like exons, since it is a total RNA seq i can have both. What would be the legitimate way to normalize the introns? can you please suggest something..Thanks
this is a different question and not related to the main question, and it would best be asked separately. RNA-Seq data, in general, would not normally contain introns at all.
also, the statement about accommodating reads "vertically" vs "horizontally" seems not quite the right interpretation of what happens.
Each copy of the transcript, once fragmented, will produce several reads. A longer sequence would produce more reads, but it is also proportionally longer. Hence the length of a sequence has no effect on the coverage of that sequence. Thus there is no such thing as a longer sequence being covered by fewer reads because it is longer.
Introns are covered at a low (normally zero) rate because they are not expressed and not because they are long.
Thanks Albert for your reply. Since you partly answered this question i just wanna ask quick follow-up. RNA-Seq data, in general, would not normally contain introns at all.. Actually it is dependent on the library preparation right. If you do poly-A selection it i possible to get few intronic reads since i'm using total RNA there is a high probability of getting introns as well (correct me if i'm wrong). It's not because of gDNA contamination because i could hardly see any reads on intergenic (The library prepared and sequenced in the one of the best institute in the world). What i'm assuming is that it could be possible that i'm fishing the premRNA vs matureRNA. Around 40% of reads goes to introns.
I am not a biologist by training, and I will admit to quite a bit of missing knowledge as to when (in a time frame) does the splicing occur exactly and how much the relative concentrations of pre-mRNA are relative to mature mRNA. How much of each would be present at any given time?
If the processing time of a pre-mRNA is much shorter than that of mature mRNA, then the former would be present in a vastly lower abundance at any given moment since the mature mRNA would accumulate - so there is the issue of what information remains and is present when the sample is extracted.
In my understanding total RNA typically means ribosomal RNA + mRNA, and that ribosomal RNA will dominate the sample at some 80 to 90%. I have never done it myself, hence I rely on the memory of what I read, but isn't the purpose of selecting polyA to enrich mRNA relative to rRNA and not enrich pre-mRNA to mature mRNA.
I have never seen systematically covered introns in RNA-Seq data. I have seen strong signals and peaks from intronic and noncoding regions, which are puzzling, but I have never seen introns completely filled in. In those cases, I would expect annotation errors.
I am not saying it is impossible - just that it is unexpected - and one should consider other reasons. As far as I know, typical polyadenylated RNA-seq does not have many reads mapping to introns - but I could be off - I have seen many unexpected features of data.
Great reply. These cells are not synchronized cells, which means we have no control over the cell division. But they are homogenous in terms of genotype. The reason why people do poly-A selection mostly for pol-II and mature transcripts. But in the total RNA preparation yes, we do deplete the ribosomes and random priming right.
how much the relative concentrations of pre-mRNA are relative to mature mRNA . It is hard to say right depends on the genes. In general heat shock and stress response tend to have lot of premRNA for the quick action.
Here is a paper I just ran into (published just a few days ago, Nov 11, 2022), that sounds to contain relevant information
Retained introns in long RNA-seq reads are not reliably detected in sample-matched short reads
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02789-6
Wow, the abstract itself scares me. However i will go through the paper.