Question

How to assign Active Enhancer to every genes within 10 kb region?

1

Entering edit mode

7.8 years ago

Bioinformatist Newbie ▴ 270

I have a list of Active Enhancer peaks (bed file) and initially I wanted it assign them to closest target gene. I used Bedops's closest-features to do it. Active Enhancers:

chr1    34737   34937   7_Weak_Enhancer 0   .   34737   34937   255,252,4
chr1    35737   35937   7_Weak_Enhancer 0   .   35737   35937   255,252,4
chr1    36137   36337   7_Weak_Enhancer 0   .   36137   36337   255,252,4

Active Enhancer Annotated to Nearest Gene:

chr1    34737   34937   7_Weak_Enhancer 0   .   34737   34937   255,252,4|chr1  24610   44610   NR_026820   FAM138F -
chr1    35737   35937   7_Weak_Enhancer 0   .   35737   35937   255,252,4|chr1  24610   44610   NR_026820   FAM138F -
chr1    36137   36337   7_Weak_Enhancer 0   .   36137   36337   255,252,4|chr1  24610   44610   NR_026820   FAM138F -

Bedops command used was:

closest-features --closest sortFinalEnhancer.txt sorted10kGTF.bed > ActiveEnhancerCatalog.bed

Now I want to assign each Enhancer to all the genes within 20 kb range (not the nearest gene within 20 kb range but all of them which are in 20 kb range). For example, the output I want will look like this (for first enhancer peak):

chr1    34737   34937   7_Weak_Enhancer 0   .   34737   34937   255,252,4|chr1  24610   44610   NR_026820   FAM138F -
chr1    34737   34937   7_Weak_Enhancer 0   .   34737   34937   255,252,4|chr1  24610   44610   NR_026820   FAM138F -
chr1    34737   34937   7_Weak_Enhancer 0   .   34737   34937   255,252,4|chr1  24610   44610   NR_026820   FAM138F -

So the 3 different genes which were mapped 1-1 before (because of nearest gene criteria within 10 kb region) are now mapped against 1 enhancer peak because they all fall with 20 kb range.

Can anyone guide me if there is any Bedops option or any other possible option to do it? Thank you.

Enhancer Bedops Closest • 2.4k views

ADD COMMENT • link updated 7.8 years ago by Alex Reynolds 36k • written 7.8 years ago by Bioinformatist Newbie ▴ 270

score 2 · Answer 1 · 2017-02-24

2

Entering edit mode

7.8 years ago

Alex Reynolds 36k

You would use bedmap instead of closest-features to map (or associate) all genes to enhancers. Add the --range N operand to pull in associations over N bases around the reference elements (enhancers), say a 20 kb window:

$ bedmap --range 10000 --echo --echo-map enhancers.bed genes.bed > answer.bed

The --echo-map operand gives you all the detail about mapped genes. There are other --echo-map-* operations to get a subset of data about mapped genes; see --help or the documentation for more information.

ADD COMMENT • link 7.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks a lot Alex for replying. I have a little problem, I already tried this way as you suggested in your answer in old question but the problem is I get all the overlaps in one line separated by ;. While I want the answer in the format as mentioned above in the question. Currently what I'm getting for my sample input file is: chr1 500 1000 Enh1 3.45 5.67|chr1 1500 2500 NR_9088 ATF2 -;chr1 11000 12000 NR_879 ATF1 +;chr1 19000 21000 NR_876 ATF1 + while what I want is:

chr1    500 1000    Enh1    3.45    5.67|chr1   1500    2500    NR_9088 ATF2    -

chr1    500 1000    Enh1    3.45    5.67|chr1   11000   12000   NR_879  ATF1    +

chr1    500 1000    Enh1    3.45    5.67|chr1   19000   21000   NR_876  ATF1    +

ADD REPLY • link 7.8 years ago by Bioinformatist Newbie ▴ 270

1

Entering edit mode

You can pipe to awk to reformat the output, e.g.:

$ bedmap --range 10000 --echo --echo-map enhancers.bed genes.bed | awk -F'|' '{ reference = $1; mappings = $2; n = split(mappings, m, ";"); for(i = 1; i <= n; i++) { print reference"|"m[i]; } }' > answer.bed

This could blow up the file, if you have a lot of overlaps. I suggest using standard input and output streams where you can, to minimize file sizes or the writing of intermediate files.

ADD REPLY • link 7.8 years ago by Alex Reynolds 36k

0

Entering edit mode

Yes, it works. A little problem left is: I'm trying to use --delim '\t' after the --echo-map but it is not working. can you guide me if I'm doing something wrong here? This is what I am trying: bedmap --range 20000 --echo --echo-map --delim '\t' enhancers.bed genes.bed | awk -F'|' '{ reference = $1; mappings = $2; n = split(mappings, m, ";"); for(i = 1; i <= n; i++) { print reference"|"m[i]; } }'

ADD REPLY • link 7.8 years ago by Bioinformatist Newbie ▴ 270

0

Entering edit mode

I can do it by using sed -i 's/|/\t/g' answer.bed but I was wondering why the --delim "\t" is not working !!

ADD REPLY • link 7.8 years ago by Bioinformatist Newbie ▴ 270

1

Entering edit mode

If you switch the --delim character from the default pipe character (|) to a tab character (\t) then you will no longer be able to use | in awk as a field separator between the reference element (reported by --echo) and the mapped elements (reported by --echo-map).

You could use sed at the very end, but you probably want to keep that | character around until then, so that you have a delimiter between reference and mapped elements that is something other than a tab character. This is because the delimiter between fields inside a reference element and inside one or more mapped elements is also a tab character, and so that makes separation more difficult.

ADD REPLY • link 7.8 years ago by Alex Reynolds 36k