Hello everyone, I am new in the field of computational biology and I am working with few paired end fastq files with the aim of prioritizing genomic variants but I am finding it very hard to understand how to get the read group information from the fastq header. here are two fastqc headers of paired end samples (whole exome sequence, Illumina)
@SN963:294:C847FACXX:1:1106:1077:2087 1:N:0:AGGCAGAA (File name -DYP26_blood_S3_L001_R1_001.fastq)
@SN963:294:C847FACXX:1:1106:1077:2087 2:N:0:AGGCAGAA (File name- DYP26_blood_S3_L001_R2_001.fastq)
will be really great if any one can explain me how to obtain the read group information.
Actually, after QC I have aligned them using BWA-MEM. now I am to call variants using GATK haplotype caller, but before that I am to re calibrate the base quality scores using GATK BQSR. when I try to perform that task, I get an error "ERROR: ReadGroup information in the BAM header is not present". I need the read group information to resolve this issue I think. If you can tell me how to obtain read group information for this purpose it will be really helpful.
I see. So that is a different issue than the one you posted as original question.
Take a look at this thread for solutions using picard to add the read group information to your BAM files: GATK, SAM file doesn't have any read groups defined in the header
Thanks, for the help. I was wondering if including the read group information in the bwa-mem step would fix this? If, Yes, then how to find out the read group information.
It would. But you can also add that information to the existing bam files. Ask people who you are analyzing the data for to get the relevant bits you need to include in the groups. If no real info is available you could use some dummy fields as indicated in the thread above.
Thanks a lot for your replies.