Dear group,
Given RNA-Seq data for either 1 or more than 1 different samples/conditions, I am interested in finding
A. an exon-intron or intron-exon junction with reads spanning these junctions.
B. Another similar question is identify intron-retention, where reads map exon-intron-exon.
A differs from B, where A coverage ends halfway in intron. There are algorithms such as iREAD for identifying intron-retention similar to B above.
Is there a way, say samtools or other tool where I can find intron-exon or exon-intron read overhang extension.
As an example, image is given for easy understanding. Thanks for your help.
-A
Image showing intron-exon read overhang for top sample, whereas for the rest of samples below no such read overhang or coverage is observed.
IMHO, identifying and estimating the abundance of intron retention using exon-intron reads are unreliable at best and often lead to overestimation. In addition to introns having one or both splice sites alternatively spliced, such an approach is sensitive to parameters like overhang and mismatches used during the alignment step. I reckon a better way is to construct an intron database by using both the reference and any novel event identified from sequencing, and count/quantify reads in the bam against it.
The file answer.bed will contain the junction and the number of reads that map to — overlap with — the optionally-padded junction, by one or more bases.
If you want the actual reads that map to the exon-intron junctions:
In this second example, the file answer.bed will contain the reads that map to the junction, by one or more bases.
Some links to Github Gists with listed awk scripts:
(transcripts2mergedExons.awk)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Dear Alex, thanks for your answer. It is very helpful.
If for e.g. I want to find only those reads spanning exon-exon junction then, can I directly jump from step "Convert transcripts to merged exons" to "Convert these to junctions" skipping an intermediate step "Make an exon-intron list"? Will it make sense?
I have used featureCounts function from the Rsubread package for this express purpose and found it worked perfectly. I recommend it if you work in R because:
it gives you precise control over the extent of overlap you want to require for a read to be counted as overlapping your intron-exon junction
it interoperates with spliced alignment BAM files as produced most excellently and quickly by the STAR aligner
you can profitably pass it acceptor and/or donor coordinates using the SAF format (if you do, you will probably want to replace GENEID with JUNCTIONID)
did you look at:
?
and what about your previous questions. e.g:
Obtaining exon-intron and intron-exon readsMapping and annotating DNA binding regions from ChIP-Seq to nearby gene did you leave a comment or marked them as "accepted" ?Hi Pierre, Thanks for the reply. I will check this:Question: Extracting Intron-Exon Reads from bam files
For you second question, I did not post that question - but that is helpful.
thanks!
Dear Alex, Thank you so much. worked beautifully!! Adrian
IMHO, identifying and estimating the abundance of intron retention using exon-intron reads are unreliable at best and often lead to overestimation. In addition to introns having one or both splice sites alternatively spliced, such an approach is sensitive to parameters like overhang and mismatches used during the alignment step. I reckon a better way is to construct an intron database by using both the reference and any novel event identified from sequencing, and count/quantify reads in the bam against it.