Hi,
I would like to plot RNA-seq reads distance to all 3'UTR in the reference GTF to see where the read end up. I could figure out something but I wonder if there is a better one, because the output file I get in roughly 80Go !! So not handy for plotting...
Here is what I got.
- Get featureCounts output BAM file: I do not filter this BAM file as I precisely want to keep both assigned and unassigned reads.
- Turn reference annotation GTF to BED file (which I have already pre-filtered to get only 3'UTR features), as well as the previous BAM to BED file. Here are tiny examples of both BED files:
reads.bed
chr1 30 40 b1 1 +
chr1 100 150 b4 1 +
chr1 270 350 b2 1 +
chr1 270 300 b3 1 -
chr1 600 700 b5 1 +
chr1 1000 4000 b6 1 -
chr1 1000 1030 b7 1 +
annotation.bed
chr1 10 20 a1 1 +
chr1 300 400 a2 1 +
chr1 350 400 a3 1 -
Then I use
bedtools closest
to get the distance of each read to the closest 3'UTR:bedtools closest -a reads.bed -b annotation.bed -D b -s > distances.bed
Here's a tiny example ofdistances.bed
(distances are in the last column):chr1 30 40 b1 1 + chr1 10 20 a1 1 + 11 chr1 100 150 b4 1 + chr1 10 20 a1 1 + 81 chr1 270 350 b2 1 + chr1 300 400 a2 1 + 0 chr1 270 300 b3 1 - chr1 350 400 a3 1 - 51 chr1 600 700 b5 1 + chr1 300 400 a2 1 + 201 chr1 1000 4000 b6 1 - chr1 350 400 a3 1 - -601 chr1 1000 1030 b7 1 + chr1 300 400 a2 1 + 601
Lastly, I load the
distances.bed
file in R and plot a density plot of all distances, which gives me something like this:
However, this plot have been made with only a sample of the file distances.bed, as the original one is far too big (80Go). I have a total of ~200 millions reads, so this much data points for plotting and lines in distances.bed
.
I am wondering if there is a better way to reach this ? Maybe using some normalized counts (see example here figure S1F)? But then I wouldn't know how to keep both assigned and unassigned reads.
Thanks for your help !
what does
distances.bed
looks like please ?I just edited my post with files example. Let me know if it's clear enough.