I am looking at the read name from the archival bam from the sequencing provider. It provides machine/ run / lane info but readgroup info isn't written in it.
What are my options if I want to extract the fastq from the bam to align with BWA to annotate the RG info so that it is used in downstream GATK calling? (kinda a reversal of the process to simulate the output of per lane fastq for alignment then per lane dedup and so on )
I have actually mapped the the sample (with reads from different lanes/ possibly different runs) to a reference already but I have 37 other samples so it would be less painful if i got it 'right' at the start. i.e. maybe an perl script to separate the fastq reads by the run/lane and dealing with each lane bam
Cheers!
Could you provide a bit more information about what "machine/run/lane" information you have? In my experience, if you only have a BAM that does not have a @RG tag in the header and corresponding RG fields for each read, then you will not have enough information to assign the reads in your BAM to their proper groups.
As I understand, the RG field info can be found in the read name?
e.g.
format of the template name (header in the bam file), is it in the format of sequencer:lane:tile:coord-x:coord-y?
e.g. from http://en.wikipedia.org/wiki/FASTQ_format
the archival bam that I have only have reads belonging to one sample (library I am not too sure but I guess it should only be one library as well)
Your information about FASTQ template names is correct, but this information has to carry over into your BAM file. My concern is that, since you don't have the original FASTQ files, you don't have any information about which sequencing lane your read came from.