Hi Everyone, I got per sample per aligned BAM files for already published genomes of human populations. In Header section i saw RG record something like this:
@RG ID:LP6005441-DNA_A09 SM:LP6005441-DNA_A09
having information for only RGID and RGSM. But to reproduce the results with GATK best practices, i have to correctly assign RG information.
Each bam I have represents a single sample from a single library prep but they were run on multiple lanes as indicated from the read information, e.g.:
HS2000-630_102:4:2115:1889:70619
HS2000-630_102:3:2311:13151:38215
HS2000-630_102:2:2315:18670:41735
So. to correctly assign the RG information unique for group of reads for each lane, i want to split the per sample BAM files into multiple BAMs with respect to Flowcell lanes. so i can go through replacing the RG information and apply Markduplicates and BQSR procedures correctly.
I am new in this, Could you please suggest any tool or script in order to do my job?
Thanks in advance!
Hi Pierre, Does this work on BAM files?
yes, and it writes bam too.