Dear Biostar community,
I encountered a strange phenomenon in a 10x sequenced dataset (replicated with my data, STARsolo samples, and official 10x samples). I have attached two code snapshots below for reproducibility.
In every 10x data I look at, I observe multiple UMI (UB) for the same cell barcode(CB), with or without considering the sample barcode or the methodology (cellRanger, STARsolo). When I dedup the cellRanger data (UMItools) by unique reads (CB, UB but not unique sequence) I remove about 35% of my data, which is plausible, I guess, however when I remove every sequence that had the exact same CB, UB and mapped to the same gene (different sequence) - I lose about 85% percent of my data.
I did see some references here - that, from my humble understanding, this might be, indeed, biological replicates of the same sequence. However:
- I am still confused about how it is biologically possible to sequence different regions of the same molecule,
- How do I correct the account for the UMI in my analysis? In my case, I do not do expression analysis so I need to use the raw BAM.
Thanks in advance!
Links:
Code:
in_bam_path=10X210_2.bam"; num_rows=1000000; samtools view -F 4 -q 255 -e '[CB] && [UB] && [NH]==1' $in_bam_path | head -n $num_rows| awk '{for(i=1;i<=NF;i++){if($i ~ /^CB:Z:/) cb=$i; if($i ~ /^UB:Z:/) ub=$i;} print cb, ub;}' |sort |uniq -c |sort -rn |head; #Excludes unmapped reads, max qual, reads that mapped to only one place and contain CB and UB tags which indicate CellBarcode and UmiBarcode, extract them and calculate the amount of reads.
in_bam_path=10X210_2.bam"; num_rows=1000000; samtools view -F 4 -q 255 -e '[CB] && [UB] && [NH]==1' $in_bam_path | head -n $num_rows| awk '{for(i=1;i<=NF;i++){if($i ~ /^CB:Z:/) cb=$i; if($i ~ /^UB:Z:/) ub=$i;} print $10, cb, ub;}'|sort |uniq |cut -d ' ' -f2,3 |sort |uniq -c |sort -rn |head #simmilar command but first removes sequence duplicates - maintaining only same UMI and same barcodes but for different sequences
Thank you so much! This clarifies indeed what I observe:) '
May I please ask a few additional questions?
I used the following UMI-tools command:
--method=unique
unless you have a very good reason for it. This treats all UMIs as different even if they only differ by a single base. Such UMI pairs are likely to be cause by PCR or sequencing error, particulalry in an experimnet with a high duplication rate, like this one.Thanks! I was under the impression that the UB tag in the BAM file should be the corrected one (compared to the original UR tag, at least in 10x cellRanger). But... I understand that I cannot make that assumption :(
Thanks again for the very informative illustration!