Question

Changing @RG header information in *.bam files using picard

0

Entering edit mode

9.3 years ago

kirannbishwa01 ★ 1.6k

I am working with multiple bam files which is aligned to my reference (using Bowtie2). I am planning to move forward to variant calling using picard for downstream GATK analysis. However, I realized that my bam files don't have @RG header info (used the following link: https://www.broadinstitute.org/gatk/guide/article?id=1317)

My situation: I have multiple bam files (1 bam file produce from 1 fasta file which was from 1 biological sample). So, if one of my raw fasta had following information in the header: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

which I think is info in casava 1.8 format and has following meanings.

EAS139   the unique instrument name
136      the run id
FC706VJ  the flowcell id
2        flowcell lane
2104     tile number within the flowcell lane
15343    'x'-coordinate of the cluster within the tile
197393   'y'-coordinate of the cluster within the tile
1        the member of a pair, 1 or 2 (paired-end or mate-pair reads only)
Y        Y if the read is filtered, N otherwise
18       0 when none of the control bits are on, otherwise it is an even number
ATCACG   index sequence

How, should I assign names so the raw sequences from the same flowcell/lane can be used by GATK to account for any variablity? Also, I don't want to loose any header information that is useful for sample identification or analysis.

I used the following command and it seems to work, but I am not exactly sure which/what information from the fasta records needs to be added to ID, SM, PL, LB, PU, etc.

java -jar picard.jar AddOrReplaceReadGroups I=BowtieOut_Sp154-4g.sorted.1.bam ID=group1 SM=sample1 PL=ILLUMINA LB=Sp154 PU=unit1 O=bamforGATK.bam

Thanks in advance :)

bam picard RG-header sequence alignment • 5.6k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.3 years ago by kirannbishwa01 ★ 1.6k

0

Entering edit mode

I have been trying to assign appropriate @RG values to my bam files before proceedings to any further analyses on my bam files (I could also do realignment using BWA or Bowtie2 if necessary).

Say, I have two populations and have sequenced gDNA from 6 biological samples (per population). I obtain raw fasta files which have following metadata information:

Population MA (6 samples) with following sample:metadata information

MA605 (both FR and RR): @HWI-ST588:81:C080WACXX:7:1101:1365:2377 1:N:0:GTGAAA
MA611: @HWI-ST588:81:C080WACXX:7:1101:1405:2307 1:N:0:GTCCGC
MA622: @HWI-ST588:82:D0CWRACXX:8:1101:1469:2246 1:N:0:CCGTCC
MA625: @HWI-ST588:82:D0CWRACXX:8:1101:1478:2200 1:N:0:ATGTCA
MA629: @HWI-ST588:81:C080WACXX:7:1101:1450:2287 1:N:0:GTAGAG
Ncm8: @HWI-ST588:81:C080WACXX:7:1101:1630:2292 1:N:0:GTGGCC

Population SP (6 samples) with following samples:metadata information

Sp154: @HWI-ST588:83:D0D0MACXX:4:1101:1431:2134 1:N:0:CGTACG
Sp164: @HWI-ST588:83:D0D0MACXX:4:1101:1168:2169 1:N:0:GGTAGC
SP21: @HWI-ST588:83:D0D0MACXX:5:1101:1168:2190 1:N:0:CTATAC
Sp3-5g: @HWI-ST588:83:D0D0MACXX:4:1101:1176:2232 1:N:0:GTTTCG
Sp76-3g: @HWI-ST588:83:D0D0MACXX:4:1101:1246:2125 1:N:0:GAGTGG
SpNor33: @HWI-ST588:83:D0D0MACXX:5:1101:1195:2154 1:N:0:CTAGCT

My understanding is that the metadata represents the following information in casava 1.8 format: https://en.wikipedia.org/wiki/FASTQ_format

@uniq_inst_name:runid:flowcell_ID:f_lane:tile#f_lane:x_co:y_co mem:filt(Y/N):bits:idx_seq

I then followed information on this link https://www.broadinstitute.org/gatk/guide/article?id=1317 to assign appropriate @RG information to downstream GATK analysis

For sample SpNor33-16:

ID = Illumina flowcell + lane name & number = 83_ D0D0MACXX_5
SM = SpNor33 (represents an unique sample)
PL = Illumina (sequencing platform)
LB = Sp (population)

For sample SP21: shares the same ID = 83_ D0D0MACXX_5 (as it was sequenced in the same flowcell, lane name & number) but is a different biological sample (SpNor33) and also belongs to the same population (Sp).

ID = Illumina flowcell + lane name & number = 83_ D0D0MACXX_5
SM = SP21
PL = Illumina
LB = Sp

For sample Sp154: @HWI-ST588:83:D0D0MACXX:4:1101:1431:2134 1:N:0:CGTACG

ID = Illumina flowcell + lane name & number = 83_ D0D0MACXX_4
SM = Sp154
PL = Illumina
LB = Sp

For sample MA625: @HWI-ST588:82:D0CWRACXX:8:1101:1478:2200 1:N:0:ATGTCA

ID = Illumina flowcell + lane name & number = 82_ D0CWRACXX_8
SM = MA625
PL = Illumina
LB = MA

My situation: Am I assigning the @RG information appropriately? Why/why not? I am confused about the LB part (should it be same sample sequenced in different library? or something else). If assigning it as "population - SP vs. MA" is not correct what should I do? What if I want to provide a "population level-tag" to each biological sample?

It's been a long question, but if you can help in any way. Thanks in advance!

ADD REPLY • link updated 2.3 years ago by Ram 44k • written 9.3 years ago by kirannbishwa01 ★ 1.6k

Ram · Answer 1 · 2015-09-21

2

Entering edit mode

9.3 years ago

Istvan Albert 102k

This is interesting question since we use these terms frequently but not always correctly. The hierarchy is

Sample --> Library -> Read group

What this means is that:

different samples cannot have the same library id
different libraries cannot have the same read group id

More relevant threads

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.3 years ago by Istvan Albert 102k

0

Entering edit mode

does this hold true with multiplexed samples?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by drtamermansour ▴ 40

0

Entering edit mode

Yeah, I think so; it holds true for multiplexed samples. In multiplexed samples applying RG become even more necessary as there might be batch effects which needs to be addressed appropriately during statistical analyses.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 8.9 years ago by kirannbishwa01 ★ 1.6k