Hi all,
I tried to repeat the results of a published paper and was unable. I have now dived deeper into their NGS data (Illumina HiSeq held on SRA) and see that the data was clearly 'cleaned' **. I am also now thinking the data might have been outright fabricated.
I'm wondering if others have encountered this and 1) how to verify from that the data is fake (from a technical, forensic standpoint), and 2) if the data is fake, how to handle the situation.
**I believe the data was cleaned because 100% of reads pass cutadapt, even though 70% of reads contain adapters and get trimmed. I find this situation to be impossible (but please correct me if I'm wrong!).
TIA!
While I think this is an interesting case, I've before found cleaned 'raw' data on SRA. It happens: bioinformaticians receive the raw data, they run a standard cleaning pipeline and pass it on to the lab-people, 6 months pass, time to upload to SRA, the raw data is long gone, only the first-pass cleaned data survived on someone's external hard-drive and goes to SRA. I've also received raw reads from sequencing providers with basic trimming applied, i.e., no adapters etc.
Yeah some of the Illumina machines pre-trim adapters (not sure if this is still the standard). I've also definitely also seen quality-trimmed data on the GEO. But I'm not sure 'cleaned' is equivalent to fabricated.
My first question would be how strong your background in such analysis is. Claim of fabrication is very serious, so be 100% sure to back it up and make sure it's not due to a potentially flawed analysis or interpretation. Woth the given information it is hard to comment in more detail.
Data being cleaned could simply be that accidental the raw fastq files were list and only trimmed ones were retained and uploaded. Not good obviously, but no fabrication.
Very strong. PhD+several years working in the field.
To expand further, the data is RNA-seq. Some samples are antibody enriched and some 'control' ribodepleted RNA. Based on picard 'CollectRnaSeqMetrics', the data looks like someone did shotgun sequencing of a genome and then spiked in exactly the data they wanted to get the peaks of the antibody enrichment. The supposedly directional reads have 40% 'wrong orientation' and intergenic reads are #1 followed by intronic #2 and finally transcripts. rRNA content is unbelievably low (<0.01%). This alone would be hallmarks of DNA contamination, but because the RNA reads are exactly what the paper would want to see it's very unbelievable the data isn't faked.
I wrote here to ask the community if there is perhaps an algorithm developed to check the reads if a pattern emerges or something else that is non-random, that could be an absolute smoking gun for the data being fabricated.
I'd say post on pubpeer -- it's the best forum for this sort of discussion.
As for what additional analysis I recommend: I'd say look at splice junctions. All RNA-seq data should have a good amount of spliced transcripts, even in nucleus extracts. I'd say also look at gene body coverage. Usually, you'll see either something that's uniform/central, 3' biased, or 5' biased. Also, plot counts vs. transcript length.
You can compare these things to dozens other RNA-seq datasets from different assay types, different tissue/cell types, different experimental conditions, different library strategies, etc. If you look at 100 datasets (and your sampling is good), and that one dataset looks strikingly different than the remaining 99 datasets, that means there's an anomaly. You can then also try this for a few WGS datasets and see if that dataset looks similar to those.
The "smoking gun" would be to have other labs repeat the same experiment and realize they can't reproduce the study's findings.
I'm a bit against pubpeer - have you ever posted there?
My gripe with them is that the posts are heavily moderated and EDITED to the point that the original post many lose the intention. Moderation is fine, but the editing part is wrong. IMO the right approach to this is to bounce a post back and forth between the moderators and the author of a post until something is agreed on. But what happens in that a post is edited and posted without the original authors approval.
I don't follow your argument for distinguishing between genomic contamination and freud. A data set being bad in terms of genomic contamination does not mean its bad in other ways.
The other possibility is that the RNAseq library is very high quality, but there was a mix up with indexes at the sequencing facility.
In terms of a smoking gun, I think it would be almost impossible to distinguish between a well done fraud and a messed up experiment. You could look at the fragment size distribution, they is something someone might forget to vary if they were taking data, but I good freud could easily fake that if they remembered.
If it were standard RNAseq, you could check for the dispersion in read counts, but 1) this would, again, be a sign of sloppy faking, a good faker could easily fix that and 2) who knows what mixing pull down RNAseq with genomic contamination would do to the count distribution.
Can you post an IGV screenshot of what this looks like?
Working on it - it's actually a bit difficult. Low coverage of ribosomal proteins. Low coverage of housekeeping genes ...I'm trying to find an area that visually depicts what I'm describing but I'm actually having trouble because of the issues mentioned above - it's like someone combined 0.001x genome coverage with an RNA spike
Can you clarify what you mean by "100% of reads pass cutadapt, even though 70% of reads contain adapters and get trimmed. " did you set a minimum post trimming size threshold? Weekday was your cutafapt commandline?
The header of one output (granted only 57.6% here), despite passing
-m 20
Relevant parts of
cutadapt
command;In which case, I'd definately look at the distribution of read lengths, post trimming, and see if there is a discontinuity in the distribution.