More than 300.000 peaks for a viral transcription factor - what could this mean?
3
2
Entering edit mode
9.4 years ago
ChIP-Tease ▴ 30

Hello everybody,

I'm a PhD student and I'm working with a viral transcription factor. My task is to find out where the transcription factor binds in the human genome and what it does there. The idea was that it would bind near promoters and activate human genes, since this is its job in the viral genome.

What I can see after ChIP-Seq is, that it seems to bind more or less everywhere. There are, according to MACS2 standard settings, more than 300.000 peaks, which are also visible if you have a look in IGV (See picture at the end of the post).

When I analyse the peaks with MEME/ DREME, I get binding motifs in up to 90 % of the peaks, which fits to the binding sites in the virus (the binding of the transcription factor at these sites was confirmed via EMSAs (electro mobility shift assays)).

By checking in IGV I can neither see an exclusive association with transcriptional start sites, exons or something else.

The RNA-Seq revealed that upon expression of the viral transcription factor cellular genes are upregulated (this is also its job in the viral genome) as well as downregulated (new feature). After 6 h the strongest upregulation is about 10 fold, the strongest downregulation to 5 % of the original expression. There are only about 10 genes strongly up- and 10 genes strongly downregulated, which was kind of strange, since the complete cells is reprogrammed.

My question would now be, if anybody knows if there are more transcription factors with about 300.000 binding sites and if people just pick the ones which fit to their concepts the best? And maybe anybody has an idea what this transcription factor could do with that many binding sites? I have some ideas but don't want to write them right now, not to push the discussion in a special direction. I'd be happy for any input!

By the way, maybe anybody knows how I can enter refSeqGenes to IGV and see the gene names instead of the NM numbers?

Thanks a lot, Alex

About the picture:

  • 1st row: ChIP input
  • 2nd row: ChIP transcription factor
  • 3th row: MACS2 peaks
  • 4rd row: RNA-Seq 0 h
  • 5th row: RNA-Seq 3 h
  • 6th row: RNA-Seq 6 h
  • 7th row: RefSeqGenes
  • 8th row: knownGeneTSS (transcriptional start sites)

Example_for_Experts

ChIP-Seq RNA-Seq • 4.2k views
ADD COMMENT
0
Entering edit mode

Can you share what virus this is?

ADD REPLY
0
Entering edit mode

Hello,

it is Epstein-Barr-Virus

ADD REPLY
0
Entering edit mode

Hello ChIP-Tease!

It appears that your post has been cross-posted to another site: http://seqanswers.com/forums/showthread.php?t=60948

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLY
0
Entering edit mode

Hey Pierre,

thanks a lot for the hint. I thought it might help since these are independent forums. But for sure you are right. A lot of people will be in both forums. I will add the link in the other forum to the top of the post! Sorry, I'm completely new in forums. Have a nice day!

ADD REPLY
0
Entering edit mode

You said "expressed the TF", this would be in transfected cells, or were they infected with virus? Do you have a control in your RNA-Seq experiments (e.g. transfected for a non-TF EBV protein)? It would be important to know if the DEGs you've found are in response to the TF specifically and not in response to transfection.

It looks like the TF is binding the genome indiscriminately and may be capable of exerting some regulatory effect on the cells. Do you have peaks in any regions corresponding to the DEGs you've identified? You could validate the binding of the TF to those genes with EMSA.

I guess the final check would be to see if these effects hold up in EBV infected cells.

ADD REPLY
6
Entering edit mode
9.4 years ago

Hi,

Just a couple of thoughts...

re: 300,000 binding sites --> peaks don't necessarily equal functional binding sites, they may also be an indication of generally open chromatin/active transcription (see Teytelman et al., 2013 or Jain et al, 2015) or, perhaps, your antibody is cross-reacting with (an)other factor(s). The fact that you find the DNA motif described for the factor indicates that this seems to be a fairly wide-spread (perhaps very short?) motif - but maybe, the factor is really binding everywhere and simply when it happens to hit a promoter, its true function is revealed.

re: other factors with that many binding sites --> I think, similar numbers have been reported for other TF, e.g. CTCF and myc, if I remember correctly - which doesn't mean all of these sites are functional, either (see above)

re: gene display in IGV --> most likely, IGV is displaying whatever information is in the 4th column, so just make sure your BED file contains whatever name/symbol/anything you'd like to see in the 4th column

ADD COMMENT
1
Entering edit mode

Hey Frederike,

Thanks a lot for your thoughts and the papers!

I would rather exclude a cross-reactivity of the antibody, since the binding motif shows up in most of the peaks (up to 90 %) (nevertheless this might also be caused by its frequent occurrence (nearly in 50 % of the cases if you split the human genome in parts of 300 bases and search for the motif)). The motif is 7 bases.

Ok, I'll check if I can find some CTCF and myc peak papers, thank you.

Thanks a lot for you input again!

ADD REPLY
0
Entering edit mode
9.4 years ago

Is your factor the only one described to bind to this DNA motif? Or is it one of the well known promoter motifs (e.g. INR or E-box-like?)

ADD COMMENT
0
Entering edit mode
7.8 years ago

Dear Mr.Alex,

I am also working with protein transcription factors based on Chip-seq. I used Seqmonk (Java - based website) to call out peaks in all chromosomes. As a result, It gave me tons of peaks as yours. Then, I have not known how to select expected peaks for my target protein's promoter as well as this mentioned website did not allow me to extract base reads so that I could figure it out the motif ( maybe I can use the MEME suite tools). I am also very in a basic level and starting to study how to write code in R. Hence, i hope that you can suggest me some ideas to solve it out. Thank you so much for your concern.

Best regards,
Thanh Lan Chu

ADD COMMENT
0
Entering edit mode

Dear Thanh Lan Chu,

The first thing I would do is checking if you can, in general, also see the peaks, which were identified by the Seqmonk in a data visualization program like IGV (integrated genome browser, this was used for the picture posted in this thread), the UCSC browser or something else. If you can really see the identified peaks, you can trust them more than before.

The peak calling programs (MACS2, Peakzilla, HOMER), which I used for peakcalling, gave me the DNA sequence of the determined peaks. These were required to go on with the motif identification. So either you will have to figure out how to extract these sequences from Chipmonk or you should use a peakcaller, which runs on the command line (MACS2, Peakzilla, HOMER). In case you know how to use the command line you can either do this on your own, ask a bioinformatician for help or use Galaxy (https://usegalaxy.org/ ; a lot of programs are preinstalled here and can be used for bioinformatic analysis with nearly no bioinformatic skills).

You can go on with the MEME suit tool for motif identification. I know three ways of motif identification with the MEME suit tool. They all have drawbacks for large amount of peaks.

  1. The MEME tool itself. In case you try to find the motif within 100,000 sequences, it will probably take a few years since it gets extremely slow for large amount of sequences.
  2. DREME: It will find your motif also if you feed it with several hundred thousand peak sequences. It gives very clear motifs, which looks nice, but you might miss motif variants.
  3. Sort the peaks for the ones, which show the highest probability, take the top 10,000 and use MEME to search for the motif. This will give you the motif of the strongest peaks, but you might miss the overall situation.

MEME or DREME will give you the most likely motifs, which does not automatically mean that your protein binds these. You would need to do, for example, EMSAs (electro mobility shift assays) to find out if your protein really binds the identified sequence.

If you need to find peaks, which are located within annotated promoters, you can use bedtools intersect (http://bedtools.readthedocs.io/en/latest/content/tools/intersect.html), which is once again a command line tool and checks for overlaps between defined sequences (e.g promoter sequence and peak sequence). I guess there will also be an alternative or this program on Galaxy.

Before you try programming something yourself on R, try to check out if someone else did the job already somewhere else (Package for R, command line tools, Galaxy,....

People might argue, that your antibody pulls down sequences unspecifically. If you can, try to do the ChIP in a cell line without this protein (knocked down by CRISR/Cas9 or RNAi) to proof that the antibody pulls down only your protein specifically.

Hope this helps,

Best regards,
Alex

ADD REPLY
0
Entering edit mode

Dear Mr.Alex,

Thank you for your comprehensive support. I finally could see various transcription factor candidates from my extract genomic DNA using Galaxy tools. However, the expected list somehow is still too long for verifying by wet experiments. Right now I am being stuck for not knowing how to scale down the mentioned list even though my Professor suggested that I should start finding some metabolic pathways. Could you give me some suggestion to clear the way just by using bioinformatics, please? Thank you so much for your time and your concern.

Best regards,
Thanh Lan

ADD REPLY
0
Entering edit mode

Dear Thanh Lan,

Sorry for the late response, but I wasn't in the lab for the last weeks.

Normally the output should be ranked by an e-value giving you the number of time how often the motif should appears by chance.

The smaller the value the more likely it shouldn't be there by change. Meaning it is probably there for some reason.

So you could start with the one with the smallest e-value and check if it makes sense in some way and find more evidence. Also searching for metabolic pathways may be an evidence.

Hope that helps a bit, Alex

ADD REPLY

Login before adding your answer.

Traffic: 1685 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6