Identify reads that span intron-exon junctions in RNA-Seq

1

Entering edit mode

6.9 years ago

oriolebaltimore ▴ 190

Dear group, Given RNA-Seq data for either 1 or more than 1 different samples/conditions, I am interested in finding

A. an exon-intron or intron-exon junction with reads spanning these junctions.
B. Another similar question is identify intron-retention, where reads map exon-intron-exon.

A differs from B, where A coverage ends halfway in intron. There are algorithms such as iREAD for identifying intron-retention similar to B above.

Is there a way, say samtools or other tool where I can find intron-exon or exon-intron read overhang extension.

As an example, image is given for easy understanding. Thanks for your help.

-A

Image showing intron-exon read overhang for top sample, whereas for the rest of samples below no such read overhang or coverage is observed.

enter image description here

RNA-Seq • 6.6k views

ADD COMMENT • link updated 6.2 years ago by Malcolm.Cook ★ 1.5k • written 6.9 years ago by oriolebaltimore ▴ 190

0

Entering edit mode

did you look at:

?

ADD REPLY • link 6.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

and what about your previous questions. e.g: ~~Obtaining exon-intron and intron-exon reads~~ Mapping and annotating DNA binding regions from ChIP-Seq to nearby gene did you leave a comment or marked them as "accepted" ?

ADD REPLY • link 6.9 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi Pierre, Thanks for the reply. I will check this:Question: Extracting Intron-Exon Reads from bam files

For you second question, I did not post that question - but that is helpful.

thanks!

ADD REPLY • link 6.9 years ago by oriolebaltimore ▴ 190

0

Entering edit mode

Dear Alex, Thank you so much. worked beautifully!! Adrian

ADD REPLY • link 6.8 years ago by oriolebaltimore ▴ 190

0

Entering edit mode

IMHO, identifying and estimating the abundance of intron retention using exon-intron reads are unreliable at best and often lead to overestimation. In addition to introns having one or both splice sites alternatively spliced, such an approach is sensitive to parameters like overhang and mismatches used during the alignment step. I reckon a better way is to construct an intron database by using both the reference and any novel event identified from sequencing, and count/quantify reads in the bam against it.

ADD REPLY • link 6.2 years ago by Eric Lim ★ 2.2k

3

Entering edit mode

6.9 years ago

Alex Reynolds 36k

Steps via awk and BEDOPS.

Build a BED file of exons:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.annotation.gtf.gz \
    | gunzip -c - \
    | awk '($3=="exon")' - \
    | gtf2bed - \
    | cut -f1-6 - \
    > gencode.v28.annotation.exons.c1t6.bed

Convert transcripts to merged exons:

$ awk -f transcripts2mergedExons.awk gencode.v28.annotation.exons.c1t6.bed > gencode.v28.annotation.mergedExons.c1t6.bed

Make an exon-intron list:

$ awk -f mergedExons2exonIntronList.awk gencode.v28.annotation.mergedExons.c1t6.bed > gencode.v28.annotation.exonsAndIntrons.c1t6.bed

Convert these to junctions:

$ awk -f exonIntronList2JunctionList.awk gencode.v28.annotation.exonsAndIntrons.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.c1t6.bed

Optionally, pad them, e.g. by 25 nt around the junction (adjust as needed):

$ bedops --everything --range 25 gencode.v28.annotation.exonsIntronJunctions.c1t6.bed > gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed

Map reads to the padded junctions:

$ bedmap --echo --count --delim '\t' gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed <(bam2bed < reads.bam) > answer.bed

The file answer.bed will contain the junction and the number of reads that map to — overlap with — the optionally-padded junction, by one or more bases.

If you want the actual reads that map to the exon-intron junctions:

$ bedmap --echo-map --delim '\t' gencode.v28.annotation.exonsIntronJunctions.pad25.c1t6.bed <(bam2bed < reads.bam) | sort-bed - > answer.bed

In this second example, the file answer.bed will contain the reads that map to the junction, by one or more bases.

Some links to Github Gists with listed awk scripts:

(transcripts2mergedExons.awk)

	BEGIN {
	FS="\t";
	old_chr = "chrN";
	old_start = -1;
	old_stop = -1;
	old_id = "*";
	old_score = "*";
	old_strand = "*";
	new_element_flag = 0;
	}
	{
	new_chr = $1;
	new_start = $2;
	new_stop = $3;
	new_id = $4;
	new_score = $5;
	new_strand = $6;

	if (old_id != new_id) {
	if (old_id != "*") {
	old_score = 1; # exon
	print old_chr"\t"old_start"\t"old_stop"\t"old_id"\t"old_score"\t"old_strand;
	}
	new_score = 1; # exon
	print new_chr"\t"new_start"\t"new_stop"\t"new_id"\t"new_score"\t"new_strand;
	old_id = new_id;
	new_element_flag = 1;
	}
	else {
	# find genomic difference between new and old elements
	diff_start = new_start - old_stop;
	if (diff_start > 0) {
	# we have an intron between new and old elements
	if (new_element_flag == 0) {
	# print the current old element
	old_score = 1; # we know it is an exon
	print old_chr"\t"old_start"\t"old_stop"\t"old_id"\t"old_score"\t"old_strand;
	}
	else {
	# we are no longer at the start of a new element
	new_element_flag = 0;
	}
	}
	else {
	# alternate splice? some other transcript of the same exon?
	# preserve the old start coordinate
	if (old_start != -1) {
	new_start = old_start;
	}
	if (new_stop < old_stop) {
	new_stop = old_stop;
	}
	}
	}

	old_chr = new_chr;
	old_start = new_start;
	old_stop = new_stop;
	old_id = new_id;
	old_score = new_score;
	old_strand = new_strand;
	}
	END {
	}

view raw transcripts2mergedExons.awk hosted with ❤ by GitHub

(mergedExons2exonIntronList.awk)

	BEGIN {
	FS="\t";
	old_chr = "chrN";
	old_start = -1;
	old_stop = -1;
	old_id = "*";
	old_score = "*";
	old_strand = "*";
	}
	{
	new_chr = $1;
	new_start = $2;
	new_stop = $3;
	new_id = $4;
	new_score = $5;
	new_strand = $6;

	if (old_id != new_id) {
	new_score = 1; # exon
	print new_chr"\t"new_start"\t"new_stop"\t"new_id"\t"new_score"\t"new_strand;
	old_id = new_id;
	}
	else {
	# we construct and print an intron, then print the exon
	old_score = 0;
	print old_chr"\t"old_stop"\t"new_start"\t"old_id"\t"old_score"\t"old_strand;
	new_score = 1; # exon
	print new_chr"\t"new_start"\t"new_stop"\t"new_id"\t"new_score"\t"new_strand;
	}

	old_chr = new_chr;
	old_start = new_start;
	old_stop = new_stop;
	old_id = new_id;
	old_score = new_score;
	old_strand = new_strand;
	}
	END {
	}

view raw mergedExons2exonIntronList.awk hosted with ❤ by GitHub

(exonIntronList2JunctionList.awk)

	BEGIN {
	FS="\t";
	old_chr = "chrN";
	old_start = -1;
	old_stop = -1;
	old_id = "*";
	old_score = "*";
	old_strand = "*";
	}
	{
	new_chr = $1;
	new_start = $2;
	new_stop = $3;
	new_id = $4;
	new_score = $5;
	new_strand = $6;

	if (old_id != new_id) {
	# new element
	old_id = new_id;
	}
	else {
	# construct and print junction
	old_score = 2;
	print old_chr"\t"old_stop"\t"(old_stop+1)"\t"old_id"\t"old_score"\t"old_strand;
	}

	old_chr = new_chr;
	old_start = new_start;
	old_stop = new_stop;
	old_id = new_id;
	old_score = new_score;
	old_strand = new_strand;
	}
	END {
	}

view raw exonIntronList2JunctionList.awk hosted with ❤ by GitHub

ADD COMMENT • link 6.9 years ago by Alex Reynolds 36k

0

Entering edit mode

Dear Alex, thanks for your answer. It is very helpful. If for e.g. I want to find only those reads spanning exon-exon junction then, can I directly jump from step "Convert transcripts to merged exons" to "Convert these to junctions" skipping an intermediate step "Make an exon-intron list"? Will it make sense?

ADD REPLY • link 5.1 years ago by ankita0007 • 0

1

Entering edit mode

6.2 years ago

Malcolm.Cook ★ 1.5k

I have used featureCounts function from the Rsubread package for this express purpose and found it worked perfectly. I recommend it if you work in R because:

it gives you precise control over the extent of overlap you want to require for a read to be counted as overlapping your intron-exon junction
it interoperates with spliced alignment BAM files as produced most excellently and quickly by the STAR aligner
you can profitably pass it acceptor and/or donor coordinates using the SAF format (if you do, you will probably want to replace GENEID with JUNCTIONID)
it is quite fast, and supports multi-threading

ADD COMMENT • link 6.2 years ago by Malcolm.Cook ★ 1.5k

Login before adding your answer.