I'm having trouble tracking down a chromatin database. I'm trying to get a list of Gene names so I can check the efficiency of a wet-lab experiment a post-doc in our laboratory designed for chromatin enrichment.
ChromDB appears to have been abandoned, as the last time it was updated was 2015. Furthermore, the link to the senior programmer is defunct, the PI is emeritus and I get a 404 error trying to download a FASTA file.
3CDB, the Chromosome Conformation Capture Database, run by Beijing Institute of Genomics (BGI) was updated January 2016 but I get an error when I try to download the chromosome FASTA. I also get a bounced e-mail when I try to contact them.
cisRED a chromatin motif database is also giving me a 503 error.
I found 36 chromatin regulators from Chromatin Regulator Cistrome and a dozen more from the Chromatin Motif Database. I also found ~90 histones and histone variants from HIstome .
Surely there has to be a more comprehensive database somewhere, I just don't know where to find it.
Your help would be greatly appreciated.
EDIT TO INCLUDE MORE INFORMATION/BACKGROUND
A post-doc, along with a collaborator, have developed a wet-lab enrichment for chromatin using LC-MS. I'm trying to compare these chromatome pull-downs to whole proteome experiments to investigate what proportion, relative to all proteins, is nuclear proteins to see how well these assays have worked.
First, I had the idea of exploiting microscopy, specifically immunohistochemistry, to annotate all proteins with their known subcellular localization. I used data from The Human Protein Atlas to annotate both the chromatome samples and whole cell samples to see how efficient the assay was at getting nuclear proteins and also t o check if there was any evidence that the chromatin preparation also contains other fractions (mitochondria, ER, lysosomes, etc.). I counted spectra counts >0 for each sample (number of proteins present), for each cell compartment, and compared this to the total number of proteins in that sample to get the relative proportions; this was done because the chromatome assays were 1D-shotgun 1 fraction and the whole cell assays were 2D-shotgun 50 fraction (higher depth).
Here is a figure of Relative proportion http://tinypic.com/r/2gtaqsy/9. The whole cell experiment are listed as .core (e.g. HAP1_P5242.core compared to all other HAP1 chromatin pull-downs). Based on the figure it appears there is at best a minimal enrichment of nuclear proteins in the chromatome assays compared to the whole cell proteome. This is not what we saw for selected protein by western blot. This either means all chromatome sample preps for mass-spec did not work (unlikely) or that the metric I used was not ideal. I thought that perhaps this is because I was simply counting presence of proteins and not taking the actual abundance of spectral counts into account.
Therefor, I tried to come up with a better scoring system, I have attached a "back-of-the-envelope" calculation for my proposed scoring system http://tinypic.com/r/2vnkpvt/9. Briefly 1) weigh spectral counts by the quality of annotation 2) penalized proteins which have evidence of being present in multiple cell compartments (i.e. give a higher weight to those only found in the nucleus) to come up with a pseudocount. I took the sum of these pseudocounts for a subcellular location and divided by the total sum of all pseudocounts for every protein in the sample got a rather disappointing result http://tinypic.com/r/105r7s2/9. (Proportion is greater than 100 because some proteins are found in multiple subcellular compartments).
The post-doc did a nuclear versus cytosolic fractionation of various cell lines (using RCC1 as a nuclear control and tubulin as the cytoplasmic control) and she gave me a list of gene names from the nuclear fractionation. I compared how many of these were found compared to the total number of proteins in the sample to get this figure http://tinypic.com/r/v5hki0/9. Along the x-axis the percentage. It looks a bit better (at least for the K562 and HAP1 samples) but I would have still expected to be seeing that the chromatome pull-downs worked better.
RCC1 is not always bound to chromatin but rather stays in the nucleus "floating" and associates with chromatin only when needed (during mitosis I believe). Perhaps I'm not seeing a clear enrichment of nuclear proteins in the chromatome pull downs because it is not a nuclear extraction but a chromatin extraction (the insoluble part). So ideally what I would like is a database of common chromatin associated proteins (readers, writers, erasers) to re-run this analysis. I would just need a list of official gene symbols (like the ones required in DAVID -step 2 "select identifier). These could include predictive chromatin features such as histones (both canonical or variant histones), regulators such as the SWI/SNF chromatin remodeler complex and associated proteins, TFs, etc. When you mention "some sort of chromatin feature (which one)" I supposed it would useful to have group chromatin features into different categories according to their known function in transcriptional regulation. We are interested in finding de-novo features in the nucleus, in addition to known chromatin features, so I would like to cast a wide-net.
When you say finding gene names is easy I guess I just don't really know where to look; there is a wide range of ENCODE or ROADMAP datasets measured by different sequencing techniques (ChIP-seq, CAGE, RNA-PET, RNA-Seq, ATAC-Seq, ChIA-PET, Hi-C, etc.) and it's a bit overwhelming where to find exactly what I need.
My previous approaches have been somewhat less than eloquent and perhaps someone can suggest a more sophisticated technique/analysis for determining how to assess the efficacy of these chromatome pull-downs and ultimately how to develop a cutoff that will distinguish real vs. background binding.
Gene names according to what? Finding gene names is easy, but, given what you've written, you presumably want these annotated by some sort of chromatin feature (which one?).
I edited the post to describe in more detail what I was trying to achieve. This should give you (and other biostars users) a better understanding of my approach and potentially suggest alternatives. Thank you very much.
Thanks, that's much clearer now!
Concerning the problem with gene names/symbols/identifiers, work with one reference genome annotation, don't try to mix and match. I would suggest to use Ensembl as a lot of info is already integrated there. To do things properly you would probably need to remap all identified MS peptides to the chosen reference proteome because the set of identified proteins depends on the protein set the peptides are mapped to. All MS experiments are plagued by unspecific binders. There are lists of known "contaminants" for example in the crapome database that you can start with. You could assess your chromatin fraction with a typical GO enrichment analysis looking at terms related to chromatin. Another approach to assess purity is to look at contaminants from other organelles such as mitochondria. You could also take a few well characterized complexes and check how many of the subunits you find. For ideas, have a look at previous similar studies such as this one which looked at the composition of mitotic chromosomes.
1) The Human Protein Atlas contains both Ensembl ID's and Gene name, however, the data-set provided to me by our Mass-Spec group has Entrez_ID and Gene_symbol. This is why I was comparing the Official_Gene_Symbol - which is what DAVID calls them - (e.g. SMARCA4) from my dataset to The Human Protein Atlas to get it's sub-cellular location (e.g. nucleoplasm). Conversely, one could use Biomart to change Entrez-IDs to Ensembl-IDs (or vice-versa), either way it's the same gene so I'm failing to understand your point when you say "there is a lot of info already integrated in there [Ensembl]". Could you please reiterate in more depth?
2) Essentially I am mapping to reference proteome(s). Assays that enriched for the "chromatome" are being compared to a core proteome (i.e. total lysate).
3) Figure 3 from my post (http://tinypic.com/r/105r7s2/9) was assessing the purity by looking for proteins from other organelles like mitochondria and ER which mainly contain specifically located proteins, whereas the plasma membrane and nuclear substructures have multilocalizing proteins (Thul et al., 2017 Science) Note: I previously called multilocalizing proteins *promiscuous for lack of a better term.*
4) Thank you for your suggestion about assessing with a GO enrichment analysis. If the chromatome assay has more genes mapping to GO terms associated with the nucleus than the core proteome then I could say the assays have worked. I'm just wondering how I would normalize for the total number of proteins though? The core proteome reference was done at a greater depth so has many more proteins i.e. core reference (n=9231) compared to chromatome assays which ranges from (n=500-2000).
5) The crapome database contains lists of proteins identified from negative control experiments using affinity purification followed by mass spec (AP-MS). Since my data is LC-MS I wouldn't be able to use workflow 3; however, I believe what you suggest would be workflow 2. I would select the filter for nuclear fraction (non for cell type because they don't have HAP1, K562, MDS or A673 cell lines; also I wouldn't be able to select any affinity approach filters because my experiment was not an affinity purification). So now I have a data matrix that shows the average spectral count and number of experiments. What I do is not very clear from the Tutorial so I read the paper. From my understanding I think I'm supposed to filter out all proteins from my chromatome assays which are <= to the average spectral count from crapome? Is this correct?
As this reply is already a bit long I'll stop it here and read the study you provided; thanks for providing primary literature in your answer.
Thank you very much for taking the time to inform someone who is new to mass spectometry I really appreciate it!
P.S. For anyone interested I've found a more comprehensive database (than Chromatin Regulator Cistrome, Chromatin Motif Database, and HIstome) of proteins involved in epigenetic regulation: EpiFactors, although it was last updated in 2015.
Ensembl annotates genomic sequences with more than just genes/transcripts/proteins. For example, genes are annotated with GO terms so you could easily retrieve genes annotated with chromatin-related terms, proteins are also annotated with domains so you could look for chromatin-binding domains ... Enrichment analysis takes care of the sizes of the gene sets since it analyses a contingency table. As for the crapome data, I didn't use it to set a threshold but more as a guide for identifying non-specific binders to complement negative controls I had. I don't remember the paper but my view is that since those are negative control experiments, anything they pull down should be sticking non-specifically, with the more abundant being the worst "offenders". The threshold my collaborators have used in the past (for AP-MS) is to remove any protein identified by less than two peptides and with a Mascot score below 30. If you have replicates, you could also consider as "noise" any protein that is not found in the majority of the replicates.