I'm trying to use the Picard function AddOrReplaceReadGroups to format headers of bam files for GATK.
I'm unclear what I need to use as the RGLB and RGPU input arguments (read group library and platform unit). I know the latter is the sequence barcode. My question is how do I determine RGLB and RGPU from the fastq file.
The fastq files I obtained from NCBI SAR have the following header lines, from which I don't seem able to determine read group library or barcode, e.g.
@SRR8439151.1.1 1 length=150
NGCTGAGGTAATAATTACACACAACACATCGGCAGTATGCTCAAAAGCTGTTTAGGCAAAATTATACGAATTTGCATATT
CAATTGAACCGAACACATAGGCTCGGCAATGAATAACGCATGGATGAGCTTATTTCTGCAATTAAAAGTT
+SRR8439151.1.1 1 length=150
#AAAFFJJJJJJJJJJJJJJJJJJJJJJJJJJJFFFFJJJFJJF-<F--<F-FFJFJFFJJA7F-JJJJJAAJJJFA-AJ
JJAJFJFJJJFFJJJJFFA7JJ7-7AJ<<AJAFJJJFFFJ-AJ<AFFJJAF-77A<<FFFA-A<-A7<--
@SRR8439151.1.2 2 length=150
Will GATK (specifically IndelRealigner) successfully run if I just provide "placeholder" strings for RGLB and RBPU?
Yes, the strings themselves don't matter as long as they are correctly differentiated among the data, ie don't use the same RGLB placeholder for data that came from different libraries, etc.
Note however that local realignment around indels is no longer necessary if you're going to use HaplotypeCaller or Mutect2 to do your variant calling.
You can get more info and answers about GATK-specific questions from the GATK team themselves on their support forum: https://gatk.broadinstitute.org/hc/en-us/community/topics