Question

Publications with poor sequencing data

4

Entering edit mode

9.2 years ago

jotan ★ 1.3k

I frequently download datasets (mostly ChIP) from published papers for use in my own work. I just as frequently find datasets that are of exceptionally poor quality. For example, 2 million ChIP-seq reads across the mouse genome. Very low read quality and coverage, poor peak calling etc. These data have been used in "high impact" publications to justify conclusions with wide ranging implications.

Has anyone else ever encountered these problems?
Do peer reviewers ever check next-gen sequencing results?
What steps can I take to try to correct the literature?

Any advice or comments would be appreciated.

next-gen • 2.4k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by jotan ★ 1.3k

1

Entering edit mode

It would interesting to collect some of these publications (the higher "impact factor" the better) and write a paper on this phenomenon.

A related post on how a lot of the data deposited for the Ebola project does not contain data for the virus: How to find the mapping percentages for data deposited in the Zaire ebolavirus bioproject from the 2014 outbreak

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Istvan Albert 101k

1

Entering edit mode

Also to take into account: there's a lot of Mycoplasma contamination going on

ADD REPLY • link 9.2 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

I guess it depends for several things. Maybe the sequencing data is not the super best but still only 1 in dozens of other experiments all agreeing and if it makes a good discussion and relevant discovery still can be well published.

Like this paper for example: http://www.nature.com/nature/journal/v523/n7559/full/nature14452.html

They pooled the samples and make DE analysis using monoclates and with low sequencing coverage. Maybe they only wanted to spare a lot of money and it was enough to have this data for their purpose

I guess it is a case by case issue.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

Sure, but what if it's one single experiment with no other findings to support that particular conclusion? And the sequencing data is not only not the super best, but well and truly terrible?

ADD REPLY • link 9.2 years ago by jotan ★ 1.3k

1

Entering edit mode

It'll depend on the reviewers and the story then. If you can make an interesting story then the quality and reliability of your data doesn't matter much (have a look at papers in Nature and Science, many are great, many are complete crap but have a good (and likely wrong) story).

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

I guess it is unlikely to archive a good journal.

ADD REPLY • link 9.2 years ago by tiago211287 ★ 1.5k

0

Entering edit mode

These are in very good journals. If the reviewers aren't checking, no one knows.

ADD REPLY • link 9.2 years ago by jotan ★ 1.3k

Ram · Answer 1 · 2015-09-27

7

Entering edit mode

9.2 years ago

matted 7.8k

I don't have an answer to questions 2 and 3, but for 1, an interesting paper was published last year:

"Large-Scale Quality Analysis of Published ChIP-seq Data" G3 (2014), available here.

They have a very interesting supplementary figure plotting ChIP-seq data quality against journal impact factor:

Clearly there are many confounding factors at play, but it's an interesting analysis.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by matted 7.8k

2

Entering edit mode

that's pretty interesting/amusing - one logical and cynical explanation could be that poor data may "support" more outlandish statements

ADD REPLY • link 9.2 years ago by Istvan Albert 101k

1

Entering edit mode

That's funny, sad and true. Now that I think about it, it is mostly the top notch journals where I find the worst data. Can we please start using Pubpeer to mark out questionable datasets?

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by jotan ★ 1.3k

Ram · Answer 2 · 2015-09-27

3

Entering edit mode

9.2 years ago

Devon Ryan 104k

That sounds completely normal. Especially the early papers had crap quality, though it's not unusual to run into that still. Most pure wet-lab folks seem to think that anything involving NGS is magic, so the normal standards get ignored half the time.
Typically not. Similarly, most reviewers don't really check the methods section. This sort of thing is only likely to get checked if a real bioinformatician reviews a paper (and frequently not even then).
This is what things like PubPeer and pubmed commons are for.

ADD COMMENT • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

I've tried posting a few comments on PubPeer but there does not seem to be any interest and no response. The PubPeer community seems more geared towards image manipulation. It's just as easy to manipulate sequencing data and just as easy to check for manipulation but I guess the bioinformatics community is not yet policing itself. It would be helpful for everyone if we could take this a little more seriously.

ADD REPLY • link 9.2 years ago by jotan ★ 1.3k

0

Entering edit mode

If you're expecting a reply from the authors that won't happen anywhere. The best you can do is warn others.

Regarding self-policing, this is general issue with peer review and yes it would be nice if reviewers spent less time trying to get their papers cited and more time looking at the raw data...but the incentives aren't really there at the moment.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

0

Entering edit mode

Yeah, you're absolutely right. I don't know what I was expecting. I guess I was hoping there would be a better solution. It's disheartening to realise how much of the literature just cannot be trusted.

ADD REPLY • link 9.2 years ago by jotan ★ 1.3k

0

Entering edit mode

Agreed. On a related note, I still have on my "to do" list to finish writing a paper describing how many of the published hippocampus RNAseq/microarray studies that have been done aren't to be trusted (sample contamination). There are a variety of issues all around...always be wary of results, even for things you did yourself.

ADD REPLY • link 9.2 years ago by Devon Ryan 104k

1

Entering edit mode

Good advice. I'm not trying to say I'm perfect either. When handling lots of large datasets, it's easy for errors to creep in. There is a difference between inadvertent errors and using bad quality data though. It seems like anyone with a computer running Bowtie can call themselves a ChIP-seq expert and real bioinformaticians are too thin on the ground to QC this.

I am not a real bioinformatician. My training is all informal. However, I always try to get someone with more experience to double-check my work but even this is hard.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by jotan ★ 1.3k

0

Entering edit mode

No arguments from me and I'm with you 100% that we, as a community, should start making mention of this sort of thing more (e.g., with PubPeer).

ADD REPLY • link 9.2 years ago by Devon Ryan 104k