iam new to this field i come from computer science, i have taken many peak files from the ENCODE chip-seq experiment matrix and annotated them .I chose the promoters , upstream of each gene within distance(5000,1000) in the file and now I have a file with gene names and entrez_ids. I wonder every line of this file is a binding site of the specific TF ?
Transcription factors can also bind downstream of promoter regions such as introns. if you have selected genes irrespective of where the protein was bound and then extracted only promoter regions, most likely many sequences you have extracted will not have binding site for the protein.
Every line of a peak file is supposed to be a binding event of that protein (or whatever the molecule which was targeted by antibody). The main problem with the chip-seq peak is, chip seq technology is not very precise as the specificity of the binding even. So you can say that specific protein is bound to that interval but i wouldn't be so sure about the specific location. ( you need validation) or you can also check chip-exo (which is very similar to chipseq that also higher specificity.)
Like in the any experiment in the science, there is also a chance of false positives. You might also need to keep in your mind.
Thanks a lot for your immediate answer, you are very helpful!!Can i ask you something else if you may now , what validation can i do to the chip-seq annotated peak files i have, what kind of experiments that would be o biological significance? By finding the promoters of a specific TF in a specific cell for example colon cancer cell CACO2 how can i evaluate these findings...
You can knockdown the protein and see how many genes it was bound to, are affected by the knockdown. You can try and identify the cognate site for the protein and test if the binding is direct by assays such as EMSA.
Thanks again for your reply Satya but i cant knock down a protein my validation should be strictly programmatically i am not in a real lab. I created an app in R that finalyafter annotating the peaks and filtering , it takes the promoters near TSS of files i download programmatically from ENCODE CHip-seq Experiment matrix , i need a validation for what iam doing but dont know what that would be...
In general in a file of ChIPseq peaks each line represents a region in which the signal is enriched in the ChIPped sample with respect to the input sample (making the general assumption that the experiment included input samples and that peaks were called taking advantage of this).
It's not correct to say that each peak is a transcription factor binding site. Even if you had 100% specificity you could imagine that each peak contains a TFBS, but there will be part of the peak that is not part of the binding site (a typical TFBS much shorter than the typical ChIPseq peak; see also this post here for some more insight: https://www.biostars.org/p/163205/). Besides this, most likely a portion of the peaks will be a-specific, even if you filtered on FDR and used the input to assess enrichment.
Please note that if you don't give further information the average reader will have no clear idea about the functions you mention (annPeaks and addGeneIds: they are functions of a package? A software? A suite?). Improving the detail of your question will improve the specificity of the answer.
Thanks a lot from my heart for your time ! Iam grateful. Iam lost and i have limited time to finish it.
I used R and I wrote a script that downloads according to a users selections and analyzes peak files from ChIP-seq experiments from ENCODE Chip-seq experiment matrix. I am new to this field and I wanted to find if only the upregulated and downregulated genes so as to put them as input in a vizualization tool a colleague of mine created Minepath.org that will show relations between genes like these Gene A inhibits Gene B(A:->B) or GENE A triggers/regulates GENE B(A->B).
Now i know this cant be done with chip-seq data only with RNA-seq.
I have to figure out what can i do with these peaks i have in order to contact some analysis using computer in order to have some serious result to present in my master thesis.Some of the results i get from my app.
Thanks for your reply @morovatunc and @Satya and but i cant knock down a protein my validation should be strictly programmatically i am not in a real lab. I created an app in R that finalyafter annotating the peaks and filtering , it takes the promoters near TSS of files i download programmatically from ENCODE CHip-seq Experiment matrix , i need a validation for what iam doing but dont know what that would be...
Hi atsalaki, I am a bit confused by what you want to achieve in general.
First off, it's not clear what you want to validate, as there are multiple aspects you may want to consider. Sticking to what you can do without producing new data, there are already several things you could do (depending on what's your question). For example to technically validate the peaks you have, you could search for another ChIP-seq dataset on the same factor and in the same (or similar) conditions and check the peak overlap. To validate the specificity of the experiment in relation to the transcription factor that was immunoprecipitated you could create transcription factor binding sites (TFBS) from your peaks and compare them with known TFBS. To validate the targets you could search literature for support. If relevant, you could compare your signal to other ChIP-seq experiments, on either other transcription factors or histone modifications. Also, if you expect any relationship with expression you could compare with expression data and so on.
More in general, it's not clear why you are restricting yourself to peaks falling within 'promoter regions'. As morovatunk suggested, in principle each peak represents a binding event for the transcription factor that has been immunoprecipitated. Considering only promoter-associated peaks would perhaps make sense if you knew in advance that this specific TF is promoter-associated, and therefore you expect what falls outside promoter regions to be likely a false positive peak. But still, you would be excluding all the potentially misannotated promoters, for example.
Thanks a lot for your answer, one more question the annotated peak files(i used annPeaks and addGeneIds functions ) of a chip-seq data experiment peak file(e.g created from MCF-7(cell type) with CTCF(TF))what do they represent?Each line is a TF binding site am I right?
Transcription factors can also bind downstream of promoter regions such as introns. if you have selected genes irrespective of where the protein was bound and then extracted only promoter regions, most likely many sequences you have extracted will not have binding site for the protein.