I have BAM files with RG tag which is same for all samples. I need to add read groups to the BAM files for all samples. Please note these are sample specific bam files. So first, I checked the RGs:
$samtools view -H 4029_PPNI_WGS.bam | grep "^@RG"
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chrX LN:155270560
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr20 LN:63025520
@SQ SN:chrY LN:59373566
@SQ SN:chr19 LN:59128983
@SQ SN:chr22 LN:51304566
@SQ SN:chr21 LN:48129895
@SQ SN:chrM LN:16571
@RG ID:DDGD PL:illumina LB:HQ SM:4029
I have same RGID for all samples which is DDGD
.
I was looking at picard tools and this is what they suggested to replace the RGIDs:
java -jar picard.jar AddOrReplaceReadGroups \
I=input.bam \
O=output.bam \
RGID=4 \
RGLB=lib1 \
RGPL=ILLUMINA \
RGPU=unit1 \
RGSM=20
If I run the above command, it assigns only one RGID to all read groups in a bam file. What should be my strategy to replace/assign RGIDs in a bam correctly?
I do have read information, but I am not sure how to assign rgID to this bam.
an@virtual-workstation:/WGS/WGS$ samtools view 4029_PPNI_WGS.bam | head
HS2000-1111A_136:4:1303:15669:31420 99 chr1 10000 254 56M1I6M1I6M1I6M1I22M = 10096 196 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAAACCCTAAACCCTCAACCCTAACCCTAACCCTAACC CCCFFFFFGHHHHJJJJJJIIIJJJJIJJJJJJIJJJJGGIJEDFHHIC9FGGJE>D;=DCA(77???################################ BC:Z:0 XD:Z:N55^1$6^1$6^1$6^1$22 SM:i:16 AS:i:420
HS2000-1111A_136:4:1207:4085:83323 163 chr1 10001 254 100M = 10166 265 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCTAACCCTAAC @CCDDFFFHFDFFDHHIIIIJIIIEGHGIIJIJIIHGHGEIIEHDHEFEIGHGHGICC===EC@BDDDF9>=@=C@?B@CDBDBB?C,99>@>(222?C? BC:Z:0 XD:Z:87C12 SM:i:2 AS:i:953
HS2000-1111A_136:6:2108:7980:36762 99 chr1 10001 65 100M = 10251 350 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACACTCACCCTAACCCTAACCCTAACCCTAACCCTAAC @@@FFFDAHHHBHEHIJJJFFHHGEEHGGGHIGGIIIECBDE;FBF;B(==;@F############################################## BC:Z:0 XD:Z:64C2A32 SM:i:9 AS:i:65
HS2000-1111A_136:4:2208:20673:80720 99 chr1 10001 254 100M = 10276 375 TAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAC @@?DDDDDBFFBDDAHEGE;?BBF@BFFGBDBDAFGD>?BDFB@;;?FDC;@AFA@DG9@9?;?B=>B;;AC>=CBBB?C???B299?BB8<9A?A<33< BC:Z:0 XD:Z:100 SM:i:9 AS:i:503
HS2000-1111A_136:5:2202:3274:84881 99 chr1 10002 94 36M1I14M1I6M1I9M2I8M1I21M = 10098 196 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAAACCCTAACCCCAACCCCTAACCCCTAACCCCTCACCCCTACCCCCAAACCCCAACCCTAACC CCCFFBDDFHHHGC@HHIIIIIIIIIIC)@?DCFG3DD*??G2?FDHH0;;;4@@1CC########################################## AM:i:0 BC:Z:0 XD:Z:36^1$11T2^1$6^1$9^2$T1A5^1$A3T5T10 SM:i:0 AS:i:94
HS2000-1111A_136:5:2210:7977:80403 163 chr1 10002 254 100M = 10443 541 AACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCACCCCTAACC CCCFFFFFHHHHHJJJJJFIIJIIIJJJIGJEIJJJJIJJJJGGJJJJIGFHHIIJJHIJHHHHFFFDFBCEDCDABD;(5(,555?B?########### BC:Z:0 XD:Z:89T1A8 SM:i:12 AS:i:913
HS2000-1111A_136:4:1308:2945:46018 99 chr1 10004 254 100M = 10156 252 CCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT 8?=DDFFFFDFFHIIIIIHIIIIIIIIIIGIIGC)B?B88?D@DH;BFHGC=CF;(.=@GFE;2?@B9;2;;>=;2;?229555(9((99ABB8<3(2?8 BC:Z:0 XD:Z:100 SM:i:7 AS:i:876
HS2000-1111A_136:5:2206:3416:11292 99 chr1 10005 254 100M = 10105 200 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA CCCFFFFFHHHHHJJIJJJJJJIIJJIJJGHJGIJIEIIFIEGICHECFH@HIFIICGHEHF6?D>BF>6ACAB9;A?AA<AC5?A9(928?833+8?## BC:Z:0 XD:Z:100 SM:i:17 AS:i:776
HS2000-1111A_136:6:1316:17054:3007 99 chr1 10005 254 100M = 10287 382 CCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTA CCCFFFFFHHHHHJJIJJJJIJJJJJJJIJJJIJJJJJIIIIJJGGJIJJGGGIJJGIGIHHGHEFFFBCCEEDBA?BBDBC?BDD?C(2?B1(9<AB## BC:Z:0 XD:Z:100 SM:i:17 AS:i:906
HS2000-1111A_136:4:1115:14938:75430 99 chr1 10006 254 52M2I46M = 10169 263 CTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCT CCCFFFFFHHHHHJJJJJJIIIJJJIIJJJJGGIIJJGHIJJJGGIIII@BGFHJDG<E7ABEBF);AB@;5=AB?A3<AB<?9ABB<?(2<?<C<BBD< BC:Z:0 XD:Z:52^2$46 SM:i:16 AS:i:501
we cannot answer if we don't know how those group should be assigned (multiple libraries, multiple centers, multiple lanes, etc...)
Hi Pierre, I do have read information as shown in my question (I just updated). Is there a way I could use it?
I can see these could be used as RGIDs. What else do I need to use and how do I do it?
You want to use lane numbers as "Read Groups"?
I think so, because these are unique. What I have right now is one RGID (Project name) for all 1000+ samples
So you would actually be using something like this
That's right. But how do I add three read groups to one bam? I am trying to use picard's AddOrReplaceReadGroups.
@swbarnes2 posted about how to do that with this caveat.
You have read names but do you know which sample each read belongs to? Or the example you show above is just one sample?
If you have 1000 sample specific files, add the read groups to each file and then merge.
Yes it is for just one sample as an example.
I have 1000 samples with same problem. Not 1000 bams for each sample.
For each sample you just need one read group at a minimum which would allow you to merge the 1000 BAM's into one for variant calling?
So for sample 1 you can have
For sample 2
and so on
Thank you. So I don't need three different RGIDs for one sample/bam? Are you saying I can still merge 1000 samples in a joint call by having only one (unique) RGID per sample? I thought markduplicate step requires all read groups defined properly within each bam.
Are you completely sure that each lane should be a separate sample? Sure, samples could be arranged like that, I was suggesting that might be the case, but just because it might be like that doesn't mean it is.