This plot was generated from an SRA sequence file submitted with a published study. The SRA file was downloaded from the NCBI-SRA database and converted to fastq file using fastq-dump utility without any sequence processing. Is it possible that the authors submitted processed data to NCBI?
Thanks for your reply! As far as I know, one should submit raw, but not processed sequence data (directly coming from the sequencer) to the NCBI. Am I right? Do you have any information on this? Do I have to inform NCBI?
Yes, one should provide raw data to SRA, but this is far from the first case where someone didn't do that. I would suggest contacting whoever is listed as the submitter for the dataset first. Hopefully they still have the raw data...
Thanks! Please see my reply to Petr. Can this outcome be due to a difference in the sequencing platform? GEO accession number and SRA accession numbers are also mentioned in my response.
I have never seen such a thing. Phred score of 40 for all calls in all reads (this implies 1 error per 10,000 calls or 99.99% correct call). This is way too good! As far as I remember Illumina promised 75% of correct calls (above phred=30) across their platforms. The best raw data I have ever seen was just above 90% (maybe 91 or 92%) of the calls above phred 30.
If it is was preprocessed or simulated could you please tell us the purpose of the experiment and study for what it was used? Maybe it makes sense to analyze these reads that way,
My first guess would be data was generated/processed in fasta, then given artificial quality scores. But my first guess is often just oversimplifying stuff.
Nope! This data was neither preprocessed nor simulated. I downloaded this data from NCBI-GEO/SRA database. From the published paper, which reads "Data accession: all the raw and processed data can be accessed under GSE86214 (https://www.ncbi.nlm.nih.gov/geo/).", the data should be raw, and it should not resemble some simulated data.
Here are a few SRA accession number yielding such plots:
SRR5099289 - RNA Immunoprecipitation followed by RNAseq (sequenced on HiSeq 4000),
SRR5099278 - regular RNAseq (sequenced on HiSeq 4000), and
SRR5099284 - regular RNAseq (sequenced on HiSeq 4000).
SRR5099272 (sequenced on HiSeq 2000) belongs to the same Bioproject, which the authors submitted, and it does not produce such a plot.
I am not sure if this has to do with the Illumina sequencing platform.
It looks incredibly likely that they preprocessed the data before upload.
Thanks for your reply! As far as I know, one should submit raw, but not processed sequence data (directly coming from the sequencer) to the NCBI. Am I right? Do you have any information on this? Do I have to inform NCBI?
Yes, one should provide raw data to SRA, but this is far from the first case where someone didn't do that. I would suggest contacting whoever is listed as the submitter for the dataset first. Hopefully they still have the raw data...
Thanks! Please see my reply to Petr. Can this outcome be due to a difference in the sequencing platform? GEO accession number and SRA accession numbers are also mentioned in my response.