Question

More than 300.000 peaks for a viral transcription factor - what could this mean?

2

Entering edit mode

9.8 years ago

ChIP-Tease ▴ 30

Hello everybody,

I'm a PhD student and I'm working with a viral transcription factor. My task is to find out where the transcription factor binds in the human genome and what it does there. The idea was that it would bind near promoters and activate human genes, since this is its job in the viral genome.

What I can see after ChIP-Seq is, that it seems to bind more or less everywhere. There are, according to MACS2 standard settings, more than 300.000 peaks, which are also visible if you have a look in IGV (See picture at the end of the post).

When I analyse the peaks with MEME/ DREME, I get binding motifs in up to 90 % of the peaks, which fits to the binding sites in the virus (the binding of the transcription factor at these sites was confirmed via EMSAs (electro mobility shift assays)).

By checking in IGV I can neither see an exclusive association with transcriptional start sites, exons or something else.

The RNA-Seq revealed that upon expression of the viral transcription factor cellular genes are upregulated (this is also its job in the viral genome) as well as downregulated (new feature). After 6 h the strongest upregulation is about 10 fold, the strongest downregulation to 5 % of the original expression. There are only about 10 genes strongly up- and 10 genes strongly downregulated, which was kind of strange, since the complete cells is reprogrammed.

My question would now be, if anybody knows if there are more transcription factors with about 300.000 binding sites and if people just pick the ones which fit to their concepts the best? And maybe anybody has an idea what this transcription factor could do with that many binding sites? I have some ideas but don't want to write them right now, not to push the discussion in a special direction. I'd be happy for any input!

By the way, maybe anybody knows how I can enter refSeqGenes to IGV and see the gene names instead of the NM numbers?

Thanks a lot, Alex

About the picture:

1st row: ChIP input
2nd row: ChIP transcription factor
3th row: MACS2 peaks
4rd row: RNA-Seq 0 h
5th row: RNA-Seq 3 h
6th row: RNA-Seq 6 h
7th row: RefSeqGenes
8th row: knownGeneTSS (transcriptional start sites)

Example_for_Experts

ChIP-Seq RNA-Seq • 4.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by ChIP-Tease ▴ 30

0

Entering edit mode

Can you share what virus this is?

ADD REPLY • link 9.8 years ago by pld 5.1k

0

Entering edit mode

Hello,

it is Epstein-Barr-Virus

ADD REPLY • link 9.8 years ago by ChIP-Tease ▴ 30

0

Entering edit mode

Hello ChIP-Tease!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=60948

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY • link 9.8 years ago by Pierre Lindenbaum 165k

0

Entering edit mode

Hey Pierre,

thanks a lot for the hint. I thought it might help since these are independent forums. But for sure you are right. A lot of people will be in both forums. I will add the link in the other forum to the top of the post! Sorry, I'm completely new in forums. Have a nice day!

ADD REPLY • link 9.8 years ago by ChIP-Tease ▴ 30

0

Entering edit mode

You said "expressed the TF", this would be in transfected cells, or were they infected with virus? Do you have a control in your RNA-Seq experiments (e.g. transfected for a non-TF EBV protein)? It would be important to know if the DEGs you've found are in response to the TF specifically and not in response to transfection.

It looks like the TF is binding the genome indiscriminately and may be capable of exerting some regulatory effect on the cells. Do you have peaks in any regions corresponding to the DEGs you've identified? You could validate the binding of the TF to those genes with EMSA.

I guess the final check would be to see if these effects hold up in EBV infected cells.

ADD REPLY • link 8.2 years ago by pld 5.1k

Ram · Answer 1 · 2015-07-02

Hi,

Just a couple of thoughts...

re: 300,000 binding sites --> peaks don't necessarily equal functional binding sites, they may also be an indication of generally open chromatin/active transcription (see Teytelman et al., 2013 or Jain et al, 2015) or, perhaps, your antibody is cross-reacting with (an)other factor(s). The fact that you find the DNA motif described for the factor indicates that this seems to be a fairly wide-spread (perhaps very short?) motif - but maybe, the factor is really binding everywhere and simply when it happens to hit a promoter, its true function is revealed.

re: other factors with that many binding sites --> I think, similar numbers have been reported for other TF, e.g. CTCF and myc, if I remember correctly - which doesn't mean all of these sites are functional, either (see above)

re: gene display in IGV --> most likely, IGV is displaying whatever information is in the 4th column, so just make sure your BED file contains whatever name/symbol/anything you'd like to see in the 4th column

Ram · Answer 2 · 2015-07-03

0

Entering edit mode

9.8 years ago

Friederike 9.0k

Is your factor the only one described to bind to this DNA motif? Or is it one of the well known promoter motifs (e.g. INR or E-box-like?)

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 9.8 years ago by Friederike 9.0k

Ram · Answer 3 · 2017-02-10

0

Entering edit mode

8.2 years ago

chuthanhlan92 • 0

Dear Mr.Alex,

I am also working with protein transcription factors based on Chip-seq. I used Seqmonk (Java - based website) to call out peaks in all chromosomes. As a result, It gave me tons of peaks as yours. Then, I have not known how to select expected peaks for my target protein's promoter as well as this mentioned website did not allow me to extract base reads so that I could figure it out the motif ( maybe I can use the MEME suite tools). I am also very in a basic level and starting to study how to write code in R. Hence, i hope that you can suggest me some ideas to solve it out. Thank you so much for your concern.

Best regards,
Thanh Lan Chu

ADD COMMENT • link updated 2.3 years ago by Ram 45k • written 8.2 years ago by chuthanhlan92 • 0

0

Entering edit mode

Dear Thanh Lan Chu,

The first thing I would do is checking if you can, in general, also see the peaks, which were identified by the Seqmonk in a data visualization program like IGV (integrated genome browser, this was used for the picture posted in this thread), the UCSC browser or something else. If you can really see the identified peaks, you can trust them more than before.

The peak calling programs (MACS2, Peakzilla, HOMER), which I used for peakcalling, gave me the DNA sequence of the determined peaks. These were required to go on with the motif identification. So either you will have to figure out how to extract these sequences from Chipmonk or you should use a peakcaller, which runs on the command line (MACS2, Peakzilla, HOMER). In case you know how to use the command line you can either do this on your own, ask a bioinformatician for help or use Galaxy (https://usegalaxy.org/ ; a lot of programs are preinstalled here and can be used for bioinformatic analysis with nearly no bioinformatic skills).

You can go on with the MEME suit tool for motif identification. I know three ways of motif identification with the MEME suit tool. They all have drawbacks for large amount of peaks.

The MEME tool itself. In case you try to find the motif within 100,000 sequences, it will probably take a few years since it gets extremely slow for large amount of sequences.
DREME: It will find your motif also if you feed it with several hundred thousand peak sequences. It gives very clear motifs, which looks nice, but you might miss motif variants.
Sort the peaks for the ones, which show the highest probability, take the top 10,000 and use MEME to search for the motif. This will give you the motif of the strongest peaks, but you might miss the overall situation.

MEME or DREME will give you the most likely motifs, which does not automatically mean that your protein binds these. You would need to do, for example, EMSAs (electro mobility shift assays) to find out if your protein really binds the identified sequence.

If you need to find peaks, which are located within annotated promoters, you can use bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html), which is once again a command line tool and checks for overlaps between defined sequences (e.g promoter sequence and peak sequence). I guess there will also be an alternative or this program on Galaxy.

Before you try programming something yourself on R, try to check out if someone else did the job already somewhere else (Package for R, command line tools, Galaxy,....

People might argue, that your antibody pulls down sequences unspecifically. If you can, try to do the ChIP in a cell line without this protein (knocked down by CRISR/Cas9 or RNAi) to proof that the antibody pulls down only your protein specifically.

Hope this helps,

Best regards,
Alex

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 8.2 years ago by ChIP-Tease ▴ 30

0

Entering edit mode

Dear Mr.Alex,

Thank you for your comprehensive support. I finally could see various transcription factor candidates from my extract genomic DNA using Galaxy tools. However, the expected list somehow is still too long for verifying by wet experiments. Right now I am being stuck for not knowing how to scale down the mentioned list even though my Professor suggested that I should start finding some metabolic pathways. Could you give me some suggestion to clear the way just by using bioinformatics, please? Thank you so much for your time and your concern.

Best regards,
Thanh Lan

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 8.1 years ago by chuthanhlan92 • 0

0

Entering edit mode

Dear Thanh Lan,

Sorry for the late response, but I wasn't in the lab for the last weeks.

Normally the output should be ranked by an e-value giving you the number of time how often the motif should appears by chance.

The smaller the value the more likely it shouldn't be there by change. Meaning it is probably there for some reason.

So you could start with the one with the smallest e-value and check if it makes sense in some way and find more evidence. Also searching for metabolic pathways may be an evidence.

Hope that helps a bit, Alex

ADD REPLY • link updated 2.3 years ago by Ram 45k • written 8.0 years ago by ChIP-Tease ▴ 30