Question

How To Set Rg Header For Tablerecalibration?

1

Entering edit mode

13.2 years ago

PeterPan ▴ 30

hi, everyone~ I am using GATK recently. And I also use Picard's AddOrReplaceReadGroups to add RG header. Also I checked http://picard.sourceforge.net/command-line-overview.shtml#AddOrReplaceReadGroups, I still don't understand what those RG headers are used for.

For example, I pooled several samples for sequencing. I have a bam file sequenced from sample "NA006", and this sample blongs to library "TOTAL", and this sample is sequenced in Lane "NO1", and bar-code is "ATCG", sequence platform is "Illumina".

How could these information be added into RG headers like RGID, RGLB, RGPU and RGSM?

And I think these information are useful in TableRecalibration step, because batch effects exisits.

Thanks!

picard gatk • 4.1k views

ADD COMMENT • link updated 13.2 years ago by Vikas Bansal ★ 2.4k • written 13.2 years ago by PeterPan ▴ 30

score 3 · Answer 1 · 2012-06-18

A very good example is given at Galaxy.

Example of Read Group usage

Support we have a trio of samples: MOM, DAD, and KID. Each has two DNA libraries prepared, one with 400 bp inserts and another with 200 bp inserts. Each of these libraries is run on two lanes of an illumina hiseq, requiring 3 x 2 x 2 = 12 lanes of data. When the data come off the sequencer, we would create 12 BAM files, with the following @RG fields in the header:

Dad's data:
@RG     ID:FLOWCELL1.LANE1      PL:illumina     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE2      PL:illumina     LB:LIB-DAD-1 SM:DAD      PI:200
@RG     ID:FLOWCELL1.LANE3      PL:illumina     LB:LIB-DAD-2 SM:DAD      PI:400
@RG     ID:FLOWCELL1.LANE4      PL:illumina     LB:LIB-DAD-2 SM:DAD      PI:400

Mom's data:
@RG     ID:FLOWCELL1.LANE5      PL:illumina     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE6      PL:illumina     LB:LIB-MOM-1 SM:MOM      PI:200
@RG     ID:FLOWCELL1.LANE7      PL:illumina     LB:LIB-MOM-2 SM:MOM      PI:400
@RG     ID:FLOWCELL1.LANE8      PL:illumina     LB:LIB-MOM-2 SM:MOM      PI:400

Kid's data:
@RG     ID:FLOWCELL2.LANE1      PL:illumina     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE2      PL:illumina     LB:LIB-KID-1 SM:KID      PI:200
@RG     ID:FLOWCELL2.LANE3      PL:illumina     LB:LIB-KID-2 SM:KID      PI:400
@RG     ID:FLOWCELL2.LANE4      PL:illumina     LB:LIB-KID-2 SM:KID      PI:400

Note the hierarchical relationship between read groups (unique for each lane) to libraries (sequenced on two lanes) and samples (across four lanes, two lanes for each library).

So then I guess for your example it should be-

@RG     ID:FLOWCELL1.LANE1.NA006      PL:illumina     LB:Total   SM:NA006      PU:ATCG

I assigned RG ID randomly, you can decide but it should be unique.