Hello all,
Hopefully this hasn't been answered before; I tried googling and did not see the answer, but I apologize if this is a duplication. Essentially, I am interested in a script to replace a part of the header of a fastq file with the contents of a different fastq file. Further details below.
My sequence center has supplied me reads from a lane; these reads were demultiplex according to the read's i7 index, which is present in the header of the forward and reverse reads. I also have an i5 index, which was read separately into a corresponding file per read, since it differs per read and is made up of degenerate bases (it is used to detect clones in identically sized reads); this is the way the sequencing center has supplied me data and I would rather not ask them to do things differently on their end as they are already accommodating a few special requests for our project. As such, I have received three fastq files per lane:
Read 1 (Paired read), example (quality scores trimmed for space):
@K00162:210:HLN3HBBXX:2:1101:1326:1332 1:N:0:AGGAACCT
CGTCTACGTATCGGTGCGTACGGCGGCACATCCATTATGATTCGCTTTAATCTCGGTGGTGT
TCGGGCCAGCGGCTTTTTGAAAATGCAGCGGTTTGTGATCCGTTTCCTGTACTACAGGTCG
TCCGAATTTGTAAGAATGAAACTGTGGA
Read 2 (I5 index read for the paired reads)
@K00162:210:HLN3HBBXX:2:1101:1326:1332 2:N:0:AGGAACCT
CTTTAATTG
Read 3 (Paired read)
@K00162:210:HLN3HBBXX:2:1101:1326:1332 3:N:0:AGGAACCT
TATCGTTGTAGAGTTGCCTTGAATTTACTCCTGAGACCAAACATTCTCCTGCGGCTATAAACT
AAGACCAGAAAACCAGAAAGAAGCCCAACCCGCTACGCCCGCACAGCGAGTAATCAAACA
AAAATAAGATGAGTCGTAAACAGCTCCA
My current workflow concatenates the I5 to the read and then later trims it after using the concatenated read for the purposes of decloning; however, a superior solution (in my mind) would be to replace the I7 read in the header of both Read 1 and Read 3 with the I5 read from inside Read 2. I have a few reasons for this, from the software I use being written to expect clone filtering from an index position, to still being able to declone reads if I lose a pair during other quality filtering.
As such, I was curious if anyone knows of an existing script that would enable me to replace the I7 index with my I5 in my 'read 2' file. A minor added complication is that the I7 index occasionally begins with a low quality base call (N) instead of A.
EDIT: Included information that I5 is a set of degenerate bases as per comments below (thanks genomax!)
I have to ask "why" do this sort of thing?
In spite of what you feel it would be heck lot easier for your sequencing folks to re-run
bcl2fastq
and generate the right demultiplexed reads.I may be wrong, but I think it has to do with the other samples multiplexed on the lane unrelated to my project, as well as the fact my I5 reads contain degenerate bases, so demultiplexing off of it would create an enormous number of files (when I just want to know what the index reads as, not demultiplex by it). I'm not familiar with if there is a setting in bcl2fastq that would enable them to record the I5 index and the I7 and only demultiplex by the I7.
edit: and I believe one of my libraries is old enough that the bcl may no longer be available.
That is new and relevant information. I agree that depending on how many unique reads you expect demultiplexing with those can get unwieldy.
This is going to require some sort of custom script unless one of the
awk
gurus/@Pierre have a solution they can think of.