Question

Open Question For Bioinformaticians .... Need Some Ideas

2

Entering edit mode

12.3 years ago

Dataminer ★ 2.8k

Hi!

Okay this is an open question for getting some ideas.

Imagine if you have a ChIP sequencing data on some disease say Type 2 diabetes (strictly for the sake of example) and you have this data from 3 different labs, which means three different sets of data on 3 different cell types.

(This ChIP sequencing data is on binding of a transcription factor X in a diseased state).

As an bioinformatician what you will first aim to get out of this data, I mean in terms of your goal(s).

Offcourse you will look for the regions which are conserved across the cell lines and the regions which are not unique.

But what will be your generic plan to get the most out from a dataset like this.

All suggestions are welcome.

Thank you

chip-seq • 3.6k views

ADD COMMENT • link updated 4.5 years ago by Biostar 20 • written 12.3 years ago by Dataminer ★ 2.8k

6

Entering edit mode

If somebody came to me with a dataset, or 3 datasets, on which they had performed ChIP (so they did not only collect DNA but also knew what antibody to order so what transcription factor they were interested in) without a clear goal... I would first of all doubt their sanity. They must have an idea what they want to know before they start those experiments. Maybe I get your question all wrong, but for me it doesn't make sense. There must already be a goal.

ADD REPLY • link 12.3 years ago by Chris Evelo 10k

2

Entering edit mode

Hi Chris, The experiments were performed in three diffferent labs with three different goals. This resulted in genome wide DNA binding data for same TF on three different cell lines. Now, if such a data is publicly available, what would you do with it?

ADD REPLY • link 12.3 years ago by Dataminer ★ 2.8k

0

Entering edit mode

I Agree with Chris, it is still insane. Think about the questions FIRST, then look for the publicly available data that might answer these questions. Not the other way around. If you don't have ideas, read. What is your expertise? Biology? Computer science? HOwever, it is unlikely that you think about a truly interesting question out of nowhere. Find a project that already has clear questions first. If you are lucky interesting questions will arise. And then you might want to use already available data to answer them.

ADD REPLY • link 12.3 years ago by Stefano Berri 4.4k

score 3 · Answer 1 · 2012-08-23

Basic things you'll need to do before you do any more in depth analysis:

What known or predicted annotations are the TFs binding to?
How were the 3 experiments conducted? What kind of normalization methods will you need to employ?
Are the 3 cell lines from the same diabetic? What are the possible confounding factors that might give you false positive binding?
Do you have a normal, non-diabetic sample to compare to?
Gene ontology terms for the genes the TF binds to. Pathways that the genes belong to (KEGG).
What is already known biologically about the TF? Are you seeing it reflected in your data?
What is known biologically about these 3 cell types in diabetics? Are the TFs contributing to these observations?

score 3 · Answer 2 · 2012-08-23

I agree with Dk's great answer. Start with understanding the dataset(s) and ALWAYS try to get the raw data.

Just to add to Chris' comments, one must start with a set of hypotheses to test. Even if the data are meant for hypothesis generation, one must still have a direction. In your question, one hypothesis is that there are binding sites conserved across labs/experiments. Test that. Another hypothesis you propose is that there are regions that are unique to labs/experiments. Test that. One could hypothesize that TF motifs are enriched in the conserved sites. Test that. One could hypothesize that the conserved sites are close to genes that are involved in glucose metabolism. Test that. The point is that one needs to understand the biology "to get the most out from a dataset like this".

score 2 · Answer 3 · 2012-08-23

The phrase ChIP sequencing data on some disease say Type 2 diabetes doesn't makes sense to me. This depends on what the data is about,

are you pulling down a protein/factor in different conditions and cell lines and then try to compare the enrichment.
are you trying to compare the enrichment of marks (H3K4,H3K27,H3K36 ,H3K27ac) under these normal/diseased conditions.

If its about a protein being pulled down in three different labs on 3 different cell types, I would look for

how similar are the profiles [in terms of where they are binding promoters, polyA's, genebody, intergenic]
how different are the profiles
Are they enriched at different motifs.
Number of peaks generated by them and what kind of genes they are binding [GO analysis] .
Their intersections/overlap with the marks [H3K4, H3K27 etc]
If a knockout cell line is available, then going a little deeper by asking the target genes of this protein, are they expressed as well [combining ChIP-Seq and RNA-Seq]

These points come to my mind by now.

Also, check the R package called diffBind for differential binding analysis of ChIP-Seq peak data as it says.

Cheers