Enhancers With Binding Sites For Common Transcription Factors
2
0
Entering edit mode
11.8 years ago
Diana ▴ 930

Hi all,

I have a file that has enhancers in 1st column and the name of transcription factor in 2nd column for which it has binding sites. I wanted to find out which enhancers have binding sites for common transcription factors so I made a heatmap in R but since my data is so huge its impossible to estimate the no. of TFs shared by a group of enhancers. How can I accomplish this in R? My data looks like this:

Enhancer           TF
Gene1_Enhancer1    Arid3a
Gene1_Enhancer1    Hoxa4
Gene1_Enhnacer1    Ascl2
Gene1_Enhancer1    EBP
Gene1_Enhancer2    ETS1
Gene2_Enhancer1    ETS1
Gene2_Enhancer1    EBP
Gene2_Enhancer1    Arid3a
Gene2_Enhancer1    Hoxa4
Gene3_Enhancer1    Arid3a
Gene3_Enhancer1    Hoxa4
Gene3_Enhancer1    EBP
Gene3_Enhancer2    Hoxa7
Gene4_Enhancer1    Hoxa4
Gene4_Enhancer1    EBP
Gene4_Enhancer1    Arid3a

Is there a way I could have my output like this in a text file such that I have groups containing 1 or more enhancer from all 4 genes:

Group                                       Common TFs
Gene1_Enhancer1, Gene2_Enhancer1,           Arid3a, EBP, Hoxa4
Gene3_Enhancer1, Gene4_Enhancer1

Thanks a lot!!!

r • 2.9k views
ADD COMMENT
2
Entering edit mode
11.8 years ago

Use aggregate:

read in the data

Enhancer           TF
Gene1_Enhancer1    Arid3a
Gene1_Enhancer1    Hoxa4
Gene1_Enhnacer1    Ascl2
Gene1_Enhancer1    EBP
Gene1_Enhancer2    ETS1
Gene2_Enhancer1    ETS1
Gene2_Enhancer1    EBP
Gene2_Enhancer1    Arid3a
Gene2_Enhancer1    Hoxa4
Gene3_Enhancer1    Arid3a
Gene3_Enhancer1    Hoxa4
Gene3_Enhancer1    EBP
Gene3_Enhancer2    Hoxa7
Gene4_Enhancer1    Hoxa4
Gene4_Enhancer1    EBP
Gene4_Enhancer1    Arid3a

I stored it in dat, so

aggregate(dat,by=list(dat$TF),paste) gives you

Group.1                                                           Enhancer                             TF
1  Arid3a Gene1_Enhancer1, Gene2_Enhancer1, Gene3_Enhancer1, Gene4_Enhancer1 Arid3a, Arid3a, Arid3a, Arid3a
2   Ascl2                                                    Gene1_Enhnacer1                          Ascl2
3     EBP Gene1_Enhancer1, Gene2_Enhancer1, Gene3_Enhancer1, Gene4_Enhancer1             EBP, EBP, EBP, EBP
4    ETS1                                   Gene1_Enhancer2, Gene2_Enhancer1                     ETS1, ETS1
5   Hoxa4 Gene1_Enhancer1, Gene2_Enhancer1, Gene3_Enhancer1, Gene4_Enhancer1     Hoxa4, Hoxa4, Hoxa4, Hoxa4
6   Hoxa7                                                    Gene3_Enhancer2                          Hoxa7

and other way around, aggregate(dat,by=list(dat$Enhancer),paste) gives you

         Group.1                                                           Enhancer                       TF
1 Gene1_Enhancer1                  Gene1_Enhancer1, Gene1_Enhancer1, Gene1_Enhancer1       Arid3a, Hoxa4, EBP
2 Gene1_Enhancer2                                                    Gene1_Enhancer2                     ETS1
3 Gene1_Enhnacer1                                                    Gene1_Enhnacer1                    Ascl2
4 Gene2_Enhancer1 Gene2_Enhancer1, Gene2_Enhancer1, Gene2_Enhancer1, Gene2_Enhancer1 ETS1, EBP, Arid3a, Hoxa4
5 Gene3_Enhancer1                  Gene3_Enhancer1, Gene3_Enhancer1, Gene3_Enhancer1       Arid3a, Hoxa4, EBP
6 Gene3_Enhancer2                                                    Gene3_Enhancer2                    Hoxa7
7 Gene4_Enhancer1                  Gene4_Enhancer1, Gene4_Enhancer1, Gene4_Enhancer1       Hoxa4, EBP, Arid3a

Edit : I was just wondering, why the 1st category Gene1_Enhancer1 is not going with the 3rd category which is supposed to be the same, and figured out the typo in the intial data you have shown, Enhnacer --> Enhancer, correct to avoid the mismatches.

Cheers

ADD COMMENT
0
Entering edit mode

Thanks a lot. I tried this. It works well and finds the TFs common to all enhancers. I'm sorry I probably didn't make it clear. I have many enhancers from each gene like 45 say for each gene. I want to find groups of TFs that are present in groups of enhancers of all genes. For example apart from the above example there may be another group of enhancers within this huge set that shares entirely different TFs than this above group but nevertheless are similar to each other and so interesting for me. So I want to have all these different groups of enhancers with common TFs apart from TFs that are common to the entire set of enhancers which this function gives me. Is there any way to use this function for that? Thanks a lot!

ADD REPLY
1
Entering edit mode
11.8 years ago
brentp 24k

You can use a disjoint set for this. There's a nice implementation in python here: http://code.activestate.com/recipes/387776/

Essentially, you'll do

g = Grouper()
for enhancer, gene in (x.strip().split() for i, x in enumerate(open(sys.argv[1]))):
    g.join(enhancer, gene)

the result will not be exactly what you want, but it will be a fairly efficient structure so you can check:

g.joined('EBP', 'Hoxa7')

you can also iterate over the Grouper object to get all the joined objects:

for set in g:
    # extract shared genes here
    ....

to do that, it will probably be easiest if you have a way to distinguish an enhancer name from a gene name base solely on the string since grouper will lose that distinction

ADD COMMENT

Login before adding your answer.

Traffic: 1710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6