Replacing Illumina index in header of reads
1
0
Entering edit mode
7.0 years ago

Hello all,

Hopefully this hasn't been answered before; I tried googling and did not see the answer, but I apologize if this is a duplication. Essentially, I am interested in a script to replace a part of the header of a fastq file with the contents of a different fastq file. Further details below.

My sequence center has supplied me reads from a lane; these reads were demultiplex according to the read's i7 index, which is present in the header of the forward and reverse reads. I also have an i5 index, which was read separately into a corresponding file per read, since it differs per read and is made up of degenerate bases (it is used to detect clones in identically sized reads); this is the way the sequencing center has supplied me data and I would rather not ask them to do things differently on their end as they are already accommodating a few special requests for our project. As such, I have received three fastq files per lane:

Read 1 (Paired read), example (quality scores trimmed for space):

@K00162:210:HLN3HBBXX:2:1101:1326:1332 1:N:0:AGGAACCT  
CGTCTACGTATCGGTGCGTACGGCGGCACATCCATTATGATTCGCTTTAATCTCGGTGGTGT
TCGGGCCAGCGGCTTTTTGAAAATGCAGCGGTTTGTGATCCGTTTCCTGTACTACAGGTCG
TCCGAATTTGTAAGAATGAAACTGTGGA

Read 2 (I5 index read for the paired reads)

@K00162:210:HLN3HBBXX:2:1101:1326:1332 2:N:0:AGGAACCT 
CTTTAATTG

Read 3 (Paired read)

@K00162:210:HLN3HBBXX:2:1101:1326:1332 3:N:0:AGGAACCT  
TATCGTTGTAGAGTTGCCTTGAATTTACTCCTGAGACCAAACATTCTCCTGCGGCTATAAACT
AAGACCAGAAAACCAGAAAGAAGCCCAACCCGCTACGCCCGCACAGCGAGTAATCAAACA
AAAATAAGATGAGTCGTAAACAGCTCCA

My current workflow concatenates the I5 to the read and then later trims it after using the concatenated read for the purposes of decloning; however, a superior solution (in my mind) would be to replace the I7 read in the header of both Read 1 and Read 3 with the I5 read from inside Read 2. I have a few reasons for this, from the software I use being written to expect clone filtering from an index position, to still being able to declone reads if I lose a pair during other quality filtering.

As such, I was curious if anyone knows of an existing script that would enable me to replace the I7 index with my I5 in my 'read 2' file. A minor added complication is that the I7 index occasionally begins with a low quality base call (N) instead of A.

EDIT: Included information that I5 is a set of degenerate bases as per comments below (thanks genomax!)

next-gen-sequencing paired-end • 3.5k views
ADD COMMENT
0
Entering edit mode

I have to ask "why" do this sort of thing?

In spite of what you feel it would be heck lot easier for your sequencing folks to re-run bcl2fastq and generate the right demultiplexed reads.

ADD REPLY
0
Entering edit mode

I may be wrong, but I think it has to do with the other samples multiplexed on the lane unrelated to my project, as well as the fact my I5 reads contain degenerate bases, so demultiplexing off of it would create an enormous number of files (when I just want to know what the index reads as, not demultiplex by it). I'm not familiar with if there is a setting in bcl2fastq that would enable them to record the I5 index and the I7 and only demultiplex by the I7.

edit: and I believe one of my libraries is old enough that the bcl may no longer be available.

ADD REPLY
0
Entering edit mode

the fact my I5 reads contain degenerate bases

That is new and relevant information. I agree that depending on how many unique reads you expect demultiplexing with those can get unwieldy.

a superior solution (in my mind) would be to replace the I7 read in the header of both Read 1 and Read 3 with the I5 read from inside Read 2

This is going to require some sort of custom script unless one of the awk gurus/@Pierre have a solution they can think of.

ADD REPLY
0
Entering edit mode
2.7 years ago

Probably a little late, but since I recently had a similar problem and your post just came up in Google, I take the liberty to still add an answer:

paste "read1.fq" "index5.fq" "read2.fq" | paste - - - - | awk -F "\t" 'BEGIN{OFS="\n"}; match($1,/@[A-Z0-9+: ]+:/) {print substr( $1, RSTART, RLENGTH )$5,$4,"+",$10 > "integrated1.fq"}; match($3,/@[A-Z0-9+: ]+:/) {print substr( $3, RSTART, RLENGTH )$5,$6,"+",$12 > "integrated2.fq"}

Explanation:

awk processes files line by line, so unless one works with arrays to temporarily store the information, it is advantageous to have all needed information in one long line. The paste commands enable exactly this.

paste "read1.fq" "index5.fq" "read2.fq" 

All three files are read in parallel, such that the first lines of file 2 and 3 are appended to line 1 of file 1. Technically this is now one file with lines that comprise the respective line contents of each of the files together. All three headers, followed by all three sequences etc..

The second paste command

paste  - - - -

concatenates four lines each into one long line, because a FastQ entry consists of four lines each. In case your files do not contain quality scores, this step would thus need to be adapted.

Thus, we now have a long line consisting of 12 formerly singular lines from three files: HeaderRead1 HeaderRead2 HeaderRead3 SequenceRead1 SequenceRead2 SequenceRead3 + + + QualityScoresRead1 QualityScoresRead2 QualityScoresRead3

Each of the fields (formerly lines) can in awk be accessed by their numerical index: HeaderRead1 would be $1, SequenceRead1 would correspond to $4 etc. But before we print selected fields, the command needs to be initialized correctly:

awk -F "\t" 'BEGIN{OFS="\n"}; 

This calls awk and specifies Tab as the separator of the input fields. In contrast, we want OFS, the output field separator, to be newline, such that our entries are nicely split across four lines each again as they have been before.

match($1,/@[A-Z0-9+: ]+:/) {print substr( $1, RSTART, RLENGTH )$5,$4,"+",$10 > "integrated1.fq"}

That part is where "the magic" happens. On field $1, a greedy REGEX /@[A-Z0-9+: ]+:/ is matched that will catch all of the header up to the barcode, e.g. @K00162:210:HLN3HBBXX:2:1101:1326:1332 3:N:0:. The match operation automatically sets the internal variables RSTART and RLENGTH, which are used to cut out the relevant part of the header: substr( $1, RSTART, RLENGTH ). To this the read number "/1" is added as well as $5, which contains SequenceRead2 respectively your index. The comma separates fields and thus instantiates a new line, to which SequenceRead1 / $4 is written. In another new line a + is written and then the quality scores for Read1. All of this is piped into a file called integrated1.fq, which is your output file with the I5 index in the header.

The rest of the command is comparable, just different field numbers to output the correct information for Read3 respectively Read2.

I hope this will be still be useful - if not for the original poster than at least for people who happen to find this post still via Google ;-)

ADD COMMENT
0
Entering edit mode

Can you add an example of what the change looks like? While you provided a detailed explanation it would be good to see what the actual change will look like for headers.

ADD REPLY
0
Entering edit mode

Sure. The resulting headers would be @K00162:210:HLN3HBBXX:2:1101:1326:1332 1:N:0:CTTTAATTG and @K00162:210:HLN3HBBXX:2:1101:1326:1332 3:N:0:CTTTAATTG when using the regex /@[A-Z0-9+: ]+:/.

If the embedding should look like @K00162:210:HLN3HBBXX:2:1101:1326:1332 CTTTAATTG and @K00162:210:HLN3HBBXX:2:1101:1326:1332 CTTTAATTG the regex /@[A-Z0-9:]+/ will just select the first part of the header up to the gap.

Many other variations are possible of course. A substr( $1, RSTART, RLENGTH )":UMI_"$5 would produce @K00162:210:HLN3HBBXX:2:1101:1326:1332 1:N:0:UMI_CTTTAATTG respectively @K00162:210:HLN3HBBXX:2:1101:1326:1332 UMI_CTTTAATTG

ADD REPLY

Login before adding your answer.

Traffic: 2779 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6