I have recently received a collection of paired-end fastq files (WES) from our collaborators. I am following the GATK best practices workflow. I have completed the alignment, sorting&indexing step and generated a list of bam files. However, upon further inspection, I found out that the bam files do not have the RG tag that uniquely identifies each read in my analysis. I have found several resources online that talk about this issue and how to add this information manually. But all I have is a bunch of fastq files and I want to use the header information to assign the read groups myself. But this is what the headers look like:
Sample 1 - Read 1 - First 3 headers :
@NB501115:23:H3MJFBGX2:1:11101:3645:1046 1:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:16971:1046 1:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:7432:1048 1:N:0:CCGTGAGA
Sample 1 - Read 2 - First 3 headers :
@NB501115:23:H3MJFBGX2:1:11101:3645:1046 2:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:16971:1046 2:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:7432:1048 2:N:0:CCGTGAGA
I did some digging and found out that this header is a typical output from Illumina's Casava 1.8 and this is the breakdown of the components of the header.
NB501115 - the unique instrument name
23 - run id
H3MJFBGX2 - flowcell id
1 - flowcell lane
11101 - the number within the flowcell lane
3645 - x'-coordinate of the cluster within the tile
1046 - y'-coordinate of the cluster within the tile
1 - the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
N - Y if the read is filtered (did not pass), N otherwise
0 - 0 when none of the control bits are on, otherwise it is an even number
CCGTGAGA - index sequence
I am now following this solution to extract read group information from the fastq headers. The problem is I am unable to figure out what should be the SM-ID-PU tags. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster. Should I use that to construct the ID tag? SM information can be extracted from the file names. I am not sure is PU is mandatory.