project number: PRJNA505380 An example of Run accession: SRR8244780 Issue: Inconsistency between the library layout of Run and data source.
As the library layout both in ENA and SRA labeled, Runs in Bioproject PRJNA505380 should be pair-end reads data. But some of them only have a single fastq and without underscore "_1" or "_2" to indicate the pair-ended setting.
I took the example data for a closer look using following code under Ubuntu 18:
grep @SRR8244780.1000510 SRR8244780.fastq
Print some of the results here:
@SRR8244780.10005100 10005100/2
@SRR8244780.10005101 10005101/2
@SRR8244780.10005102 10005102/2
@SRR8244780.10005103 10005103/2
@SRR8244780.10005104 10005104/2
@SRR8244780.10005105 10005105/2
@SRR8244780.10005106 10005106/2
@SRR8244780.10005107 10005107/2
@SRR8244780.10005108 10005108/2
@SRR8244780.10005109 10005109/2
@SRR8244780.100051087 100051087/1
@SRR8244780.100051088 100051088/1
@SRR8244780.100051089 100051089/1
@SRR8244780.100051090 100051090/1
@SRR8244780.100051091 100051091/1
@SRR8244780.100051092 100051092/1
@SRR8244780.100051093 100051093/1
@SRR8244780.100051094 100051094/1
@SRR8244780.100051095 100051095/1
@SRR8244780.100051096 100051096/1
@SRR8244780.100051097 100051097/1
Because I can't see the original id for each read. I can only assume that all these ID I "grep"ed are unique read ID. There is no duplicated read ID showing like this:
@SRR123456789.123 123/1
@SRR123456789.123 123/2
Generally, if you have a single-end read with illumina identifier, it should look like this:
grep HWI-ST337R:419:C1NFJACXX:2:1101:13942:2686 a.fastq
Output:
@SRRxxxx.xxxx HWI-ST337R:419:C1NFJACXX:2:1101:13942:2686/1
For single-end read fastq, you should only get one read ID and no /2 tag (if I'm correct).
Clearly the read ID in my case has both /1 and /2 tags. What makes me confused is that there is no duplicated ID but contains pair-end tags in the same fastq. I'm not sure whether this data is a concatenated fastq or interleaved fastq. Someone previously use tree command to seperate interleaved fastq to two fq. I tried either, it is very time-consuming so I didn't finish it.
My question is : How to deal with this kind of data? Can I just treat it as a single-end data? Or these data cannot be used for downsteam analysis?
I used fastp for quality assessment of this data by setting it as a single-read fq. The N reads and N bases are equal as reported in ENA.
Thank you.
BTW, this is not the first time I met this problem. I don't understand why ENA and SRA both allow the submitters to mistakenly upload this kind of data as "pair-end" without simply checking how many files they uploaded. Not to mention that NCBI SRA does not allow concatenated raw fq to be uploaded.
Last day I don't have time to test it. I made it today. Unfortunately, this file is proved not to be interleaved.
I checked the SRA run browser. Apparently, it is labeled as "pair-end" reads but do not have any paired reads. It is more like a single read fq.
Another example was discussed here: SRR2969254
The same error and same pattern.
I also check the correctly labeled pair-end fq reads and single-end fq reads to ensure that I didn't misunderstand or misread anything.
For pair-end reads, it looks like:
In conclusion, if you have the same issue, first thing you need to do is to check the interleaved status by
reformat.sh input.fq vint
from BBMap.If true, you can directly input it to
bwa mem
using-p
option under bwa v0.7.11 or higher.Otherwise, it should be treated as a single-end read.
In the meanwhile, I've written a message to ENA service and they may answer this question as well. I'll update if there is anything new.
The original read ID is quite different from those correctly labeled pair-end/single-end fq as it only gives a number SRR8244780.1 1 followed by a tag /2, unlike this
HWI-ST458R:229:C43DDACXX:5:1101:1421:1920 forward
. I don't know and have no time digging up this data entry error. A personal guess is that the mismatched label somehow discard the original read ID but only left a numeric label with it. This kind of mistakes should be avoid. It may cause unnecessary misunderstanding and waste a plenty of time!Case closed.