Question

How to add RG tag into the optional field in a BAM file

1

Entering edit mode

7.0 years ago

gundalav ▴ 380

I have a BAM file that looks like this:

enter image description here

As you notice that the barcodes are included as part of read names.

Then I'm trying to use a tool called chromVAR that require RG tag to be included in the BAM file. RG tags are used to distinguish reads from different cells or samples (this is single cell ATAC-seq). Note this is not @RG header but tag for every reads as optional fields.

This is the example of RG tag as optional field (taken from another BAM file):

enter image description here

The value of that RG tag could be just the corresponding BARCODE id of that read.

My question is how can I add the the RG tag into it? I looked at PICARD AddOrReplaceReadGroups but it seems only to add as header not for every read.

sequencing single-cell bam samtools • 9.0k views

ADD COMMENT • link updated 7.0 years ago by finswimmer 16k • written 7.0 years ago by gundalav ▴ 380

0

Entering edit mode

Do you have a correspondence between the barcode and read group? ex:

CAACCATCACTC   sample10

ADD REPLY • link 7.0 years ago by Gabriel R. ★ 2.9k

0

Entering edit mode

Yes I have. BTW what I mean by RG tag is the one indicated in every read not as @RG header.

ADD REPLY • link 7.0 years ago by gundalav ▴ 380

1

Entering edit mode

yes I am aware. This is essentially a demultiplexing problem. if no one answers by the end of the day, I could be cajoled into making a slight modification to deML to account for this.

ADD REPLY • link 7.0 years ago by Gabriel R. ★ 2.9k

1

Entering edit mode

ok lets try the following:

1) sort your bam files wrt read names, NOT coordinates.

2) run the following:

samtools view sortedWRTnames.bam  | awk '{ if(substr($1,1,1)=="@"){print $0}else{ idx=substr($1,0,12); printf("%s\t", substr($1,14)); for(i=2;i<=NF;i++){ printf("%s\t",$i); } printf("XI:Z:%s\t",idx); print("YI:Z:DDDDDDDDDDDD"); } }' |samtools view -bS > sortedWRTnames_withtags.bam

3) run deML:

deML -i index.txt  -o sortedWRTnames_withtags.demultiplex.bam sortedWRTnames_withtags.bam

A few of these steps can be replaced with pipes. the index.txt is the correspondence sequence to ID :

#Index1 Name
 AACCATCACTC   sample10

ADD REPLY • link 7.0 years ago by Gabriel R. ★ 2.9k

score 2 · Answer 1 · 2018-06-07

Hello,

you could try this:

samtools view -h in.bam|awk '{ if($0 ~ "^@") {print $0} else {split($1,a,":"); gsub(/RG:Z:[^\t]*/, "RG:Z:"a[1]); print} }'|samtools view -b -o out.bam

If the line doesn't start with @, the first column is split by :. So now we should have the barcode in a[1]. gsub replaces the RG:Z tag now with this barcode.

fin swimmer

score 1 · Answer 2 · 2018-06-07

There is samtools addreplacerg which adds or replaces RG tags in records too, but it is a fixed string rather than derived per barcode. This may be useful if you already have files split up per barcode, but not otherwise.

If they're all mixed together, then you'll need to read in a table and do a lookup yourself. A hacky and badly tested perl 1-liner for this:

samtools view -h in.bam | perl -lne 'BEGIN {$"="\t";open($fh, "rg.txt"); while (<$fh>) {chomp($_);($a,$b)=/(\S+)\s+(\S+)/;$rg{$a}=$b}} if (/^@/) {print;next} ($k)=/^([^|]*)/;if (exists($rg{$k})) {print "$_\tRG:Z:$rg{$k}"} else {print "$_"}' | samtools view -b -o out.bam -

It reads a file called rg.txt which contains barcode and RG tag name per line. Note this doesn't do anything to add these to the @RG header tags, but there are other tools for that - or hack it in situ in the BEGIN block. :-)