Hello,
I have 2 bed files: one with standard length regions of 400 bp (a.bed) and one with Transcription start sites (TSS) of 1 bp length (b.bed), shown below.
I want to overlay/overlap the two files taking into consideration strandness (same strands) and plot the average expression in tpm (column 5 of b.bed) as y-axis on a single bp resolution of the 400 length regions of a.bed (as x-axis). Anyway I can do that using bedtools and/or R(GRanges/ggplot)? Is some kind of window-based approach mandatory in this case?
So the preferred resulting plot could be something like " Average expression profile vs distance from the midpoint of the 400bp region"
Thanks in advance.
a.bed
chr1 564878 565278 . . -
chr1 567998 568398 . . -
chr1 570359 570759 . . -
chr1 570535 570935 . . +
b.bed
chr1 565702 565703 chr1:565702-565703,- 0.0206969 -
chr1 566022 566023 chr1:566022-566023,- 0.0206969 -
chr1 566754 566755 chr1:566746-566767,+ 0.320802 +
chr1 566783 566784 chr1:566771-566790,- 0.0724392 -
chr1 566893 566894 chr1:566891-566908,+ 0.155227 +
chr1 566925 566926 chr1:566924-566930,- 0.124181 -
chr1 566935 566936 chr1:566935-566936,+ 0.651953 +
chr1 567067 567068 chr1:567067-567068,+ 0.1655753 +
chr1 567583 567584 chr1:567583-567584,- 0.0206969 -
--UPDATE/clarification--
The desired plot would look something like figures 5 and 6 on this link :http://cistrome.org/cr/help.php. An average expression profile from b.bed over the 400 bp regions from a.bed.
Are the region in a.bed never overlapping to each other ?
There are actually overlapping entries inside a.bed. (And there are several overlaps with b.bed, i just show the construct in the example above (answering to your comment in your response below)). But i can subset based on some annotation and have non-overlapping subsets. So you can consider that they don't overlap if that's easier
Ok then let say that a an entry in b is found in two region of a (thus overlapping). For example region of a : a1: 1-400 and a2: 50-450 ; position of b : 100 . So b will be at a relative position of 100 for a1 and 50 for a2 . So should the single position of b (100) be counted twice ?
No, that's what i am referring to as a summarised average. In order for it to be meaningful these cases should be added up and divided by the number of counts observed (or just get the expression value from b.bed once) and then averaged across all the 400bps of the a.bed. But in any case I am interested in subsets of a.bed entries that won't be overlapping to eachother so no need to bother with that.
Thanks, but I was thinking of genome wide and not per chromosome resolution. As I said Average expression (y-axis) vs 400 bp region of the a.bed regions. Like a type of ChIP-seq signal over genomic region plot (e.g. some bp around a TSS). Any solutions?
That's what I proposed. For each region in a.bed perform the average TPM in b.bed.
Maybe I haven't explained it correctly I want to get summarised average expression/expression profile across the region of 400bp. In other words, I want to know where I get expression from b.bed across these 400 bp in a.bed and not on a chromosome basis. To visualise it: Check figure5 and 6 in this link : http://cistrome.org/cr/help.php
Ok that's different ;)