As you notice that the barcodes are included as part of read names.
Then I'm trying to use a tool called chromVAR that require RG tag to be included in the BAM file. RG tags are used to distinguish reads from different cells or samples (this is single cell ATAC-seq). Note this is not @RG header but tag for every reads as optional fields.
This is the example of RG tag as optional field (taken from another BAM file):
The value of that RG tag could be just the corresponding BARCODE id of that read.
My question is how can I add the the RG tag into it? I looked at PICARD AddOrReplaceReadGroups
but it seems only to add as header not for every read.
yes I am aware. This is essentially a demultiplexing problem. if no one answers by the end of the day, I could be cajoled into making a slight modification to deML to account for this.
If the line doesn't start with @, the first column is split by :. So now we should have the barcode in a[1]. gsub replaces the RG:Z tag now with this barcode.
There is samtools addreplacerg which adds or replaces RG tags in records too, but it is a fixed string rather than derived per barcode. This may be useful if you already have files split up per barcode, but not otherwise.
If they're all mixed together, then you'll need to read in a table and do a lookup yourself. A hacky and badly tested perl 1-liner for this:
It reads a file called rg.txt which contains barcode and RG tag name per line. Note this doesn't do anything to add these to the @RG header tags, but there are other tools for that - or hack it in situ in the BEGIN block. :-)
Do you have a correspondence between the barcode and read group? ex:
Yes I have. BTW what I mean by RG tag is the one indicated in every read not as
@RG
header.yes I am aware. This is essentially a demultiplexing problem. if no one answers by the end of the day, I could be cajoled into making a slight modification to deML to account for this.
ok lets try the following:
1) sort your bam files wrt read names, NOT coordinates.
2) run the following:
3) run deML:
A few of these steps can be replaced with pipes. the index.txt is the correspondence sequence to ID :