Confusion regarding manual inclusion of read group information from fastq files
0
0
Entering edit mode
3.0 years ago
Gene_MMP8 ▴ 240

I have recently received a collection of paired-end fastq files (WES) from our collaborators. I am following the GATK best practices workflow. I have completed the alignment, sorting&indexing step and generated a list of bam files. However, upon further inspection, I found out that the bam files do not have the RG tag that uniquely identifies each read in my analysis. I have found several resources online that talk about this issue and how to add this information manually. But all I have is a bunch of fastq files and I want to use the header information to assign the read groups myself. But this is what the headers look like:

Sample 1 - Read 1 - First 3 headers :

@NB501115:23:H3MJFBGX2:1:11101:3645:1046 1:N:0:CCGTGAGA  
@NB501115:23:H3MJFBGX2:1:11101:16971:1046 1:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:7432:1048 1:N:0:CCGTGAGA  

Sample 1 - Read 2 - First 3 headers :

@NB501115:23:H3MJFBGX2:1:11101:3645:1046 2:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:16971:1046 2:N:0:CCGTGAGA
@NB501115:23:H3MJFBGX2:1:11101:7432:1048 2:N:0:CCGTGAGA

I did some digging and found out that this header is a typical output from Illumina's Casava 1.8 and this is the breakdown of the components of the header.

NB501115 - the unique instrument name  
23 - run id  
H3MJFBGX2 - flowcell id  
1 - flowcell lane  
11101 - the number within the flowcell lane  
3645 - x'-coordinate of the cluster within the tile  
1046 - y'-coordinate of the cluster within the tile  
1 - the member of a pair, 1 or 2 (paired-end or mate-pair reads only)  
N - Y if the read is filtered (did not pass), N otherwise  
0 - 0 when none of the control bits are on, otherwise it is an even number  
CCGTGAGA - index sequence  

I am now following this solution to extract read group information from the fastq headers. The problem is I am unable to figure out what should be the SM-ID-PU tags. The unique portion of the read names that come after flow cell lane, and separated by colons, are tile number, x-coordinate of cluster and y-coordinate of cluster. Should I use that to construct the ID tag? SM information can be extracted from the file names. I am not sure is PU is mandatory.

readgroup sequencing bwa fatsq • 726 views
ADD COMMENT

Login before adding your answer.

Traffic: 2366 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6