I have been trying to perform variant calling from the FASTQ files available from IGSR (https://www.internationalgenome.org/data-portal/data-collection/grch38) but am having problems at the BaseRecalibration stage. I am working on WSL2 and aligned reads using bowtie2.
My command is of the form:
gatk BaseRecalibrator -I HG00102_hGRCH38_exome_aignment_sorted.bam -R GRCh38_latest_genomic.fna --known-sites GRCh38.variant_call.all.vcf -O recal_data.table
This gives the following error:
"A USER ERROR has occurred: Number of read groups must be >= 1, but is 0"
The problem is clear - I don't have any read groups. I confess I didn't know what read groups were but GATK was helpful (https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups):
"There is no formal definition of what a 'read group' is, however in practice this term refers to a set of reads that are generated from a single run of a sequencing instrument.
In the simple case where a single library preparation derived from a single biological sample was run on a single lane of a flow cell, all the reads from that lane run belong to the same read group. When multiplexing is involved, then each subset of reads originating from a separate library run on that lane will constitute a separate read group."
I checked if read groups were present in my aligned files by inspecting the file and via:
samtools view HG00102_hGRCH38_exome_aignment_sorted.bam | grep '^@RG'
This doesn't return any hits - so presumably I have failed to specify some instruction which passes data from the FASTQ file to the alignment file.
The information in the FASTQ file must be in the read identifier. The first read was inspected via:
head -n 4 SRR077485_1.fastq
This gives the following:
@SRR077485.1 HWI-EAS110_105238883:7:1:3380:1139/1
GTTGTCTAATGTCTGNTGGTTAAATCTTCTATCATCCCTGTCCTGCCTGGCTCATCAGGAATCTGNAGGAGTCTGANGAGGAGGAANTCCCCCAGGAGTC
+
FBBFADGDGG@7@>@!AC?=?==<=HHH<DHHHGHGDHHHHHHHHHHHHHHHDGHEGGGE=AAAB!=@@:@:??@?!@?:=8==;7!37:>?>9??5>5;
A supplemental file (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/1000genomes.sequence.index) shows that this data was obtained via an Illumina Genome Analyzer II. I think 'HWI-EAS110_105238883' corresponds to the name of the sequencer. The '/1' at the end corresponds to pair 1 of 2 (paired end reads). This leaves 4 numbers inbetween which I assume correspond to:
- The flowcell lane = 7
- Tile number within the flowcell lane = 1
- x coordinate of the cluster within the tile = 3380
- y coordinate of the cluster within the tile = 1139
Given the definition by GATK above, I think the read group for this read should be '7' (it could be allocated any number as long as all the other reads from lane 7 have the same number, but 7 makes sense). If we look at the end of the file the last read ID is '@SRR077485.22674341 HWI-EAS110_105238883:7:120:19772:21467/1'. This also seems to be from lane 7 so I hope it can be assumed that all the reads from this individual can be allocated to the same read group for the purposes of base quality recalibration. A short script using awk and cut could be used to check this but I haven't writen anything yet.
Assuming the above is sound, it appears the read groups could either be added when performing alignment (maybe for next time) or post hoc with a tool from Piccard - AddOrReplaceReadGroups (http://broadinstitute.github.io/picard/command-line-overview.html#AddOrReplaceReadGroups). From the documentation, I have copied the structure of the command and added values I hope are appropriate.
java -jar picard.jar AddOrReplaceReadGroups \
I=HG00102_hGRCH38_exome_aignment_sorted.bam \
O=HG00102_hGRCH38_exome_aignment_sorted_rg.bam \
RGID=7 \ (the lane as discussed above)
RGLB=2864119778 \ (obtained from https://www.ncbi.nlm.nih.gov/sra/?term=SRR077485)
RGPL=ILLUMINA \
RGPU=105238883.7.2864119778 \ (https://angus.readthedocs.io/en/2017/Read_group_info.html seems to mention a general form of {FLOWCELL_BARCODE}.{LANE}.{library-specific identifier})
RGSM=HG00102 (obtained from https://www.ncbi.nlm.nih.gov/sra/?term=SRR077485)
I'd appreciate any feedback on whether this is a reasonable approach.
In particular I am not confident in the value I have chosen for the RGPU. I believe 105238883 may represent the flowcell barcode (given the read identifier @SRR077485.1 HWI-EAS110_105238883:7:1:3380:1139/1) but I can't find this written down anywhere.
It seems as though I could have specified the read groups at the time of alignment using bowtie2 with the argument --rg-id <text> (http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml) but here only the ID is required in contrast to all the other information I need to provide to use the Piccard tool. Is this the approach others would take?
Thanks for your time!
Angus