Looking at the older ENCODE RNA-seq FASTQs, it seems that a staff member has pasted together two paired-end FASTQ files into one, and removed all of the pairing identifiers. I inferred this from the FASTQC quality chart, which shows low quality from about 35 to about 76 bases, then a sudden jump to a high quality, then dropping again towards the end. Also, these runs are from 2009, so 152 base single end reads on the Genome Analyser II were not possible. Another element of concatenation is that this file has 80 million records, which implies concatenating of lanes. The maximum reads per lane was about 20 million in those days. Hopefully, the biases are the same across lanes.
For example, consider :
$ zcat GM12878/wgEncodeCshlLongRnaSeqGm12878CellTotalFastqRep1.fastq.gz | head -n 4
@TUPAC:1:1:5:710#0/1
GTGGCGTTCAGCCACNCGAGATTGAGCAATNACNGGTCTGTGATNCNCTTAGATGTCCGGGGCTGCACGAGCGCCAAAAGACGGGGCGGTGTGTACAAAGGGCAGGGACTTAATCAACGCAAGCTTATGACTCGCCATTCATNNNANNNTCN
+TUPAC:1:1:5:710#0/1
a`a_\V\__\Q\aaZDZ`V`^```\^\ZZ[DPYDZM\``^V]]VDYD[__[O]`a^VJTUWY`ZWZBBBBBBBBBBabbbaabbaa``aW_\\WT__]OTZ\[QLWWBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
Does anyone have a script which can reconstitute the two FASTQs, as they would have originally been ? It'd help with trimming the ends, which are of awful quality. I think all that is required is splitting every second line down the middle, and also changing the #0/1 to #0/2 for the other FASTQ.
There are no newer datasets of whole cell, total RNA for the ENCODE Tier 1 cell lines, so I would like to work with these files.
@JC: be cautious: if any run time warnings of any kind ,they will end up in read_2.fq
true, but can be avoided writing in separated files with minor code changes.
Can use FIle handles