Steps via awk
and BEDOPS.
Build a BED file of exons:
$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz \
| gunzip -c - \
| awk '($3=="exon")' - \
| gtf2bed - \
| cut -f1-6 - \
> gencode.v28.annotation.exons.c1t6.bed
Convert transcripts to merged exons:
$ awk -f transcripts2mergedExons.awk gencode.v28.annotation.exons.c1t6.bed > gencode.v28.annotation.mergedExons.c1t6.bed
Make an exon-intron list:
$ awk -f mergedExons2exonIntronList.awk gencode.v28.annotation.mergedExons.c1t6.bed > gencode.v28.annotation.exonsAndIntrons.c1t6.bed
Convert these to junctions:
$ awk -f exonIntronList2JunctionList.awk gencode.v28.annotation.exonsAndIntrons.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.c1t6.bed
Optionally, pad them, e.g. by 25 nt around the junction (adjust as needed):
$ bedops --everything --range 25 gencode.v28.annotation.exonsIntronJunctions.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed
Map reads to the padded junctions:
$ bedmap --echo --count --delim '\t' gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed <(bam2bed < reads.bam) > answer.bed
The file answer.bed
will contain the junction and the number of reads that map to — overlap with — the optionally-padded junction, by one or more bases.
If you want the actual reads that map to the exon-intron junctions:
$ bedmap --echo-map --delim '\t' gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed <(bam2bed < reads.bam) | sort-bed - > answer.bed
In this second example, the file answer.bed
will contain the reads that map to the junction, by one or more bases.
Some links to Github Gists with listed awk
scripts:
- (
transcripts2mergedExons.awk
)
- (
mergedExons2exonIntronList.awk
)
- (
exonIntronList2JunctionList.awk
)
did you look at:
?
and what about your previous questions. e.g:
Obtaining exon-intron and intron-exon readsMapping and annotating DNA binding regions from ChIP-Seq to nearby gene did you leave a comment or marked them as "accepted" ?Hi Pierre, Thanks for the reply. I will check this:Question: Extracting Intron-Exon Reads from bam files
For you second question, I did not post that question - but that is helpful.
thanks!
Dear Alex, Thank you so much. worked beautifully!! Adrian
IMHO, identifying and estimating the abundance of intron retention using exon-intron reads are unreliable at best and often lead to overestimation. In addition to introns having one or both splice sites alternatively spliced, such an approach is sensitive to parameters like overhang and mismatches used during the alignment step. I reckon a better way is to construct an intron database by using both the reference and any novel event identified from sequencing, and count/quantify reads in the bam against it.