Question

News:Example of how bioinformaticians can publish in Scientific Reports (by nature publishing group) using publicly available NGS data

12

Entering edit mode

7.8 years ago

David Langenberger 11k

Changes of bivalent chromatin coincide with increased expression of developmental genes in cancer

enter image description here

Abstract

Bivalent (poised or paused) chromatin comprises activating and repressing histone modifications at the same location. This combination of epigenetic marks at promoter or enhancer regions keeps genes expressed at low levels but poised for rapid activation. Typically, DNA at bivalent promoters is only lowly methylated in normal cells, but frequently shows elevated methylation levels in cancer samples. Here, we developed a universal classifier built from chromatin data that can identify cancer samples solely from hypermethylation of bivalent chromatin. Tested on over 7,000 DNA methylation data sets from several cancer types, it reaches an AUC of 0.92. Although higher levels of DNA methylation are often associated with transcriptional silencing, counter-intuitive positive statistical dependencies between DNA methylation and expression levels have been recently reported for two cancer types. Here, we re-analyze combined expression and DNA methylation data sets, comprising over 5,000 samples, and demonstrate that the conjunction of hypermethylation of bivalent chromatin and up-regulation of the corresponding genes is a general phenomenon in cancer. This up-regulation affects many developmental genes and transcription factors, including dozens of homeobox genes and other genes implicated in cancer. Thus, we reason that the disturbance of bivalent chromatin may be intimately linked to tumorigenesis.

read complete publication: http://www.nature.com/articles/srep37393

publication NGS • 5.4k views

ADD COMMENT • link updated 18 months ago by Ram 44k • written 7.8 years ago by David Langenberger 11k

4

Entering edit mode

This is an article in Scientific Reports, which is NPG's equivalent to PLOS One. Although I care more about the quality of the work than where it is published, I wouldn't refer to all journals published by NPG as "Nature".

ADD REPLY • link 7.8 years ago by Lars Juhl Jensen 11k

4

Entering edit mode

Combining gene mutation with gene expression data improves outcome prediction in myelodysplastic syndromes

Check this article also. Completely done using publcially available microarray data, published in nature communications. I would highly recommend to check their supplement reproducible code. Beautiful R code on regression models and plotting, compiled with kntr!

ADD REPLY • link 7.8 years ago by poisonAlien ★ 3.2k

2

Entering edit mode

Sorry... What is the purpose of this post? I mean, interesting paper but... Why posting it on Biostars?

ADD REPLY • link 7.8 years ago by dariober 15k

2

Entering edit mode

It nicely shows how one can use public available data, that were created to answer completely different questions, for a completely new analysis. And these results can be published in nature. I think that these are good news for bioinformaticians, who pretty often think they can only work with wet-labs and expensive sequencing runs.

ADD REPLY • link 7.8 years ago by David Langenberger 11k

2

Entering edit mode

Fair enough, but the I think the whole encode, 1000 genomes, blueprint, etc have been produced and made public in part with the idea of enabling other researchers to mine these data. There are a lot papars using these data, so I'm not sure this paper is any special in this respect.

ADD REPLY • link 7.8 years ago by dariober 15k

5

Entering edit mode

I did not claim that it is any special in this respect. It is just an example. I can delete the post, if you feel better then. I am not in the mood for this discussion, sorry. It is a new year and I do not want to spam anyone.

I know the people who wrote it and they were proud of the fact that they could publish it that high without expensive experiments. So I thought it might be worth to share this experience.

ADD REPLY • link 7.8 years ago by David Langenberger 11k

0

Entering edit mode

Sorry... I was just trying to understand...

ADD REPLY • link 7.8 years ago by dariober 15k

1

Entering edit mode

You don't have to be sorry. I got your point and changed the title. I just don't want to make a mountain out of a molehill.

I like discussions, but sometimes it is just not worth it. ;)

ADD REPLY • link 7.8 years ago by David Langenberger 11k

1

Entering edit mode

I had the same question as dariober, but your answer makes sense. Perhaps including that in the top post clarifies quite a bit.

ADD REPLY • link 7.8 years ago by WouterDeCoster 47k

1

Entering edit mode

Well, good point. I changed the title.

ADD REPLY • link 7.8 years ago by David Langenberger 11k

score 3 · Answer 1 · 2017-01-03

3

Entering edit mode

7.8 years ago

John 13k

While data-driven science is obviously a nice prospect for people who live and breath biological data, there are some serious issues that I think need to be addressed before it can be accepted in quite the same way that traditional hypothesis-driven research can be. There can be over 1 million parameter tweaks in a given pipeline, all of which generate a different answer with a % probability of being true/false. While it would take an individual significant amounts of time to do 1 million different actual experiments, P-hacking/result-hacking can be done programatically overnight. I'm not saying that this paper or any other does indeed use such sneaky techniques - i'm just saying that for a researcher who needs to determine how reliable the findings of a paper are, you couldn't possibly know. Data driven research has yet to provide a reliable way to feel confident about the conclusions they are coming to. Between very small but highly significant changes, and publications that only document 10% of the computational work done, I find myself seeing, accepting, but never really believing, the conclusions of such papers. To my own loss most likely. Hopefully pure in silico research will find itself being replicated by others, hopefully with different computational tools but still arriving at the same answers. I hope that sort of thing becomes the norm.

ADD COMMENT • link 7.8 years ago by John 13k

3

Entering edit mode

Right. Without source code this paper is certainly no example of reproducible research, even though being both in silico and based on public data, it was well suited to be an examplar.

ADD REPLY • link 7.8 years ago by Jeremy Leipzig 22k

1

Entering edit mode

When should we expect to see a scientific report (or a real paper) from you? Is the thesis finally done? Since it is New Year thought I should ask :)

ADD REPLY • link 7.8 years ago by GenoMax 146k

0

Entering edit mode

If you had asked me 6 months ago if i'd have it done by new year's i'd have said yes, absolutely - but alas i'm probably still a few weeks away. I've been very unwell the past two months (and i think it shows, i've been very inactive on Biostars lately), so everything has dragged out a bit longer than i'd hope.

ADD REPLY • link 7.8 years ago by John 13k

0

Entering edit mode

Sorry to hear that. Hope you feel better soon.

ADD REPLY • link 7.8 years ago by GenoMax 146k

1

Entering edit mode

This point is well taken. I completely agree. But how boring would be a bioinformaticians life without heuristics, statistics and black boxes. :)

ADD REPLY • link 7.8 years ago by David Langenberger 11k

0

Entering edit mode

Heheh, well it would certainly be less interesting :) Although sometimes I feel like i'm a researcher of black boxes rather than biological data -_-;

ADD REPLY • link 7.8 years ago by John 13k

1

Entering edit mode

I agree completely that results should be considered provisional until replicated/validated, but the same is true of hypothesis-driven research. And many of the issues that you raise (cherry-picking data, inadequate documentation of methods) are not limited to data-driven science. While p-hacking may be easier, I can assure you that bench scientists (of which I am a member) are every bit as capable of manipulating data to get the results they want. Note that I am not so cynical as to believe that this behavior is the norm (for either data or bench science) but, as always, caveat emptor.

ADD REPLY • link 7.8 years ago by harold.smith.tarheel ★ 5.0k

1

Entering edit mode

You're right, of course, however I still think that sort of trick is much harder to pull off on the wetlab side of things. The reagents are expensive, and the work laborious. I think we wetlab scientists often try multiple initial experiments, and unless they come back with promising results that path isn't pursued further unless there is really strong prior evidence that something interesting can be found.

Conversely, in silico, if you don't like the results DESeq gives you, there's always Cufflinks. I think that low time/resource cost to just try an experiment another way is the problem - and you're very right that it'll probably become more of a problem for wetlab stuff as more experiments become automated and are cheaper to perform. Hm.

ADD REPLY • link 7.8 years ago by John 13k

score 1 · Answer 2 · 2017-01-02

1

Entering edit mode

7.8 years ago

Sinji ★ 3.2k

This is interesting. I wonder if they set out to test this hypothesis and if so what made them interested in pursing this? Or if they were simply mining data and happened upon this discovery.

ADD COMMENT • link 7.8 years ago by Sinji ★ 3.2k

1

Entering edit mode

They coincidentally saw this behaviour in lymphoma and then tested it in the other cancer types.

ADD REPLY • link 7.8 years ago by David Langenberger 11k

score 0 · Answer 3 · 2017-01-03

0

Entering edit mode

7.8 years ago

Lluís R. ★ 1.2k

Interesting example! Thanks for sharing!

I am not much familiarized with methylation experiments, and I read the methods section with interest but I couldn't find any mention to normalizations applied (Makes me wonder if I need more background to understand how they do reach those conclusions or that I don't know how to read articles). Shouldn't each study be normalized to be compared? Aren't there batch/study effects?

Maybe that would be a question on its own but as you know the authors, maybe you are familiarized with the analysis.

ADD COMMENT • link 7.8 years ago by Lluís R. ★ 1.2k

2

Entering edit mode

I think this is a yes and no question. You are right, normally you would normalize all HM450k data together, to make them comparable among each other. But in my experience, the beta values are already somehow normalized, i.e. in [0,1], so the normalization with other arrays does not change a lot, as long as they have been normalized within any group. In this study we used an intra-array normalization (i.e. a methylation relative to the average beta-value of the same array) for the cancer-control classification. Thus, the data were normalized. For the expression/methylation relations, we did only do comparisons within the single studies, so we used the published, normalized data. As we found all cancers to behave similarly (and not groups from the same sequencing center etc), we concluded that the findings are not batch effects. This also proofs that the cancer effect is stronger than any possible batch effect.