Question

Obtaining read group information from the Fastq file

0

Entering edit mode

8.7 years ago

sktbanerjee1 ▴ 30

Hello everyone, I am new in the field of computational biology and I am working with few paired end fastq files with the aim of prioritizing genomic variants but I am finding it very hard to understand how to get the read group information from the fastq header. here are two fastqc headers of paired end samples (whole exome sequence, Illumina)

@SN963:294:C847FACXX:1:1106:1077:2087 1:N:0:AGGCAGAA (File name -DYP26_blood_S3_L001_R1_001.fastq)
@SN963:294:C847FACXX:1:1106:1077:2087 2:N:0:AGGCAGAA (File name- DYP26_blood_S3_L001_R2_001.fastq)

will be really great if any one can explain me how to obtain the read group information.

fastq readgroup • 15k views

ADD COMMENT • link updated 4.5 years ago by GenoMax 154k • written 8.7 years ago by sktbanerjee1 ▴ 30

0

Entering edit mode

Actually, after QC I have aligned them using BWA-MEM. now I am to call variants using GATK haplotype caller, but before that I am to re calibrate the base quality scores using GATK BQSR. when I try to perform that task, I get an error "ERROR: ReadGroup information in the BAM header is not present". I need the read group information to resolve this issue I think. If you can tell me how to obtain read group information for this purpose it will be really helpful.

ADD REPLY • link 8.7 years ago by sktbanerjee1 ▴ 30

0

Entering edit mode

I see. So that is a different issue than the one you posted as original question.

Take a look at this thread for solutions using picard to add the read group information to your BAM files: GATK, SAM file doesn't have any read groups defined in the header

ADD REPLY • link 8.7 years ago by GenoMax 154k

0

Entering edit mode

Thanks, for the help. I was wondering if including the read group information in the bwa-mem step would fix this? If, Yes, then how to find out the read group information.

ADD REPLY • link 8.7 years ago by sktbanerjee1 ▴ 30

0

Entering edit mode

It would. But you can also add that information to the existing bam files. Ask people who you are analyzing the data for to get the relevant bits you need to include in the groups. If no real info is available you could use some dummy fields as indicated in the thread above.

ADD REPLY • link 8.7 years ago by GenoMax 154k

0

Entering edit mode

Thanks a lot for your replies.

ADD REPLY • link 8.6 years ago by sktbanerjee1 ▴ 30

score 2 · Answer 1 · 2017-03-20

2

Entering edit mode

8.7 years ago

GenoMax 154k

There is no group information in fastq header (if you are thinking of SAM format read groups).

Edit: To be clear read read group information can be partially constructed using information present in fastq headers. It is not natively present in format expected in SAM/BAM files.

Illumina fastq headers are explained in this WikiPedia entry.

ADD COMMENT • link 4.5 years ago by GenoMax 154k

1

Entering edit mode

I disagree. There is flowcell and lane information in every read ID.

@A00152:398:H32M5DSX2:1:1101:1136:1016 1:N:0:CTAATAACCG+CGATGCGGTT
NTGATAAAGGGAATATCTTCCCCTACAAGCTAGAAAGAAGCATTCTGTGAAACTTGTTTGTGATGTGTGTACTCAACTAACAGAGTTGAACCTTTCTTTTTACAGAGCAGTTTTGAAACACTCTTTTTGTAGAATCTGCGAGGGGATATTT

The read came from flowcell H32M5DSX2 and within Lane 1.

Our sequencing service provider concatenated all eight lanes for each sample before sending the data to us. I wish bwa mem had some special values to specify that it should automatically extract the flowcell and lane from every read in the FASTQ file from the 3rd and 4th position.

ADD REPLY • link 4.5 years ago by dario.garvan ▴ 520

2

Entering edit mode

All I am saying above is there is no read group information (in the format expected in SAM files) in fastq headers. Can it be constructed from fastq headers? Certainly. I have added a line in clarification.

ADD REPLY • link 4.5 years ago by GenoMax 154k