Question

Plot assigned / unassigned reads distance to 3'UTR

0

Entering edit mode

4.0 years ago

nlehmann ▴ 150

Hi,

I would like to plot RNA-seq reads distance to all 3'UTR in the reference GTF to see where the read end up. I could figure out something but I wonder if there is a better one, because the output file I get in roughly 80Go !! So not handy for plotting...

Here is what I got.

Get featureCounts output BAM file: I do not filter this BAM file as I precisely want to keep both assigned and unassigned reads.
Turn reference annotation GTF to BED file (which I have already pre-filtered to get only 3'UTR features), as well as the previous BAM to BED file. Here are tiny examples of both BED files:

reads.bed

chr1    30  40  b1  1   +
chr1    100 150 b4  1   +
chr1    270 350 b2  1   +
chr1    270 300 b3  1   -
chr1    600 700 b5  1   +
chr1    1000    4000    b6  1   -
chr1    1000    1030    b7  1   +

annotation.bed

chr1    10  20  a1  1   +
chr1    300 400 a2  1   +
chr1    350 400 a3  1   -

Then I use bedtools closest to get the distance of each read to the closest 3'UTR: bedtools closest -a reads.bed -b annotation.bed -D b -s > distances.bed Here's a tiny example of distances.bed (distances are in the last column):

 chr1   30  40  b1  1   +   chr1    10  20  a1  1   +   11
 chr1   100 150 b4  1   +   chr1    10  20  a1  1   +   81
 chr1   270 350 b2  1   +   chr1    300 400 a2  1   +   0
 chr1   270 300 b3  1   -   chr1    350 400 a3  1   -   51
 chr1   600 700 b5  1   +   chr1    300 400 a2  1   +   201
 chr1   1000    4000    b6  1   -   chr1    350 400 a3  1   -   -601
 chr1   1000    1030    b7  1   +   chr1    300 400 a2  1   +   601

Lastly, I load the distances.bed file in R and plot a density plot of all distances, which gives me something like this:

density plot

However, this plot have been made with only a sample of the file distances.bed, as the original one is far too big (80Go). I have a total of ~200 millions reads, so this much data points for plotting and lines in distances.bed. I am wondering if there is a better way to reach this ? Maybe using some normalized counts (see example here figure S1F)? But then I wouldn't know how to keep both assigned and unassigned reads.

Thanks for your help !

bedtools RNA-seq closest • 1.1k views

ADD COMMENT • link 4.0 years ago by nlehmann ▴ 150

0

Entering edit mode

what does distances.bed looks like please ?

ADD REPLY • link 4.0 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

I just edited my post with files example. Let me know if it's clear enough.

ADD REPLY • link 4.0 years ago by nlehmann ▴ 150