In bam file from single-cell RNA-seq experiments processed with 10X's cellranger 3.0 pipeline, the @RG:SM tag is set by sample while the @CB is computed from cell barcode.
$ samtools view possorted_genome_bam.bam | head -n 1
> NS500651:49:"RUN":4:12405:6951:17414 256 1 11256 0 97M1S * 0 0 "SEQ" "qual"
> NH:i:5 HI:i:2 AS:i:83 nM:i:6 RE:A:I li:i:0 BC:Z:ATCTTTAG QT:Z:A/AAAAEE CR:Z:GAACGGACAGACGCAA CY:Z:/AAAAEAEEEEEEEEE CB:Z:GAACGGACAGACGCAA-1 UR:Z:TAGCTGTAAT UY:Z:EEEEEEEEEE UB:Z:TAGCTGTAAT RG:Z:MYSAMPLE:0:1:"RUN":4
For cell-specific genotyping, it would make more sense to treat the @CB barcode as individual's identity.
Is there an efficient way to move the @CB tag values to @RG tag ? This way, each unique @RG defines a cellular barcode.
Currently I do:
(1) split the bam by @CB tag (thanks to https://www.biostars.org/p/263346/#263348)
samtools view -@ 32 <input.bam> | cut -f 12- | tr "\t" "\n" | grep "^CB:Z:" | cut -d ':' -f 3 | sort | uniq | while read S; do samtools view -h <input.bam> | awk -v tag="CB:Z:$S" '($0 ~ /^@/ || index($0,tag)>0)' > out/${S}.sam ; done
(2) edit @RG SM based on file name with Picard/AddOrReplaceReadGroups
for s in $(ls out/*.sam); do SM=$(basename $s .sam) ; java -jar /sandbox/apps/bioinfo/binaries/picard/2.19.0/picard.jar AddOrReplaceReadGroups I=$s O=out-RGSM/${SM}.bam RGID=1 RGLB=0.1 RGSM=${SM} RGPU=MYSAMPLE RGPL=ILLUMINA; done
samtools merge -r - out-RGSM/*.bam | samtools view -o merged-RGSM.bam -
samtools index merged-RGSM.bam
(3) call variants per sample
bcftools mpileup -r chr22 -Ou -f GRCh38.fasta merged-RGSM.bam | bcftools call -mv -Ob -o calls-RGSM.bcf
This is very inefficient and memory/disk-consuming. How could I improve this process by either (1) properly replace @RG:SM
with CB
or to run (2) bcftools mpileup with CB
instead of RG:SM
?
Thanks, JB
Hi JB, I am dealing with a similar problem and using your solution, thanks! .. but did you ( or somebody else) find a better solution for this? Cheers, Raúl