Question

Multiple same UMI's mapped to the same cell with different sequence

0

Entering edit mode

4 months ago

zbidav ▴ 30

Dear Biostar community,

I encountered a strange phenomenon in a 10x sequenced dataset (replicated with my data, STARsolo samples, and official 10x samples). I have attached two code snapshots below for reproducibility.

In every 10x data I look at, I observe multiple UMI (UB) for the same cell barcode(CB), with or without considering the sample barcode or the methodology (cellRanger, STARsolo). When I dedup the cellRanger data (UMItools) by unique reads (CB, UB but not unique sequence) I remove about 35% of my data, which is plausible, I guess, however when I remove every sequence that had the exact same CB, UB and mapped to the same gene (different sequence) - I lose about 85% percent of my data.

I did see some references here - that, from my humble understanding, this might be, indeed, biological replicates of the same sequence. However:

I am still confused about how it is biologically possible to sequence different regions of the same molecule,
How do I correct the account for the UMI in my analysis? In my case, I do not do expression analysis so I need to use the raw BAM.

Thanks in advance!

Links:

Biostars possible reference #9544530
- scSNV: accurate dscRNA-seq SNV co-expression analysis using duplicate tag collapsing

Code:

in_bam_path=10X210_2.bam"; num_rows=1000000; samtools view -F 4 -q 255 -e '[CB] && [UB] && [NH]==1' $in_bam_path | head -n $num_rows| awk '{for(i=1;i<=NF;i++){if($i ~ /^CB:Z:/) cb=$i; if($i ~ /^UB:Z:/) ub=$i;} print cb, ub;}' |sort |uniq -c |sort -rn |head; #Excludes unmapped reads, max qual, reads that mapped to only one place and contain CB and UB tags which indicate CellBarcode and UmiBarcode, extract them and calculate the amount of reads.
in_bam_path=10X210_2.bam"; num_rows=1000000; samtools view -F 4 -q 255 -e '[CB] && [UB] && [NH]==1' $in_bam_path | head -n $num_rows| awk '{for(i=1;i<=NF;i++){if($i ~ /^CB:Z:/) cb=$i; if($i ~ /^UB:Z:/) ub=$i;} print $10, cb, ub;}'|sort |uniq |cut -d ' ' -f2,3  |sort |uniq -c |sort -rn |head #simmilar command but first removes sequence duplicates - maintaining only same UMI and same barcodes but for different sequences

10x UMI BAM scRNAseq Barcode • 471 views

ADD COMMENT • link 4 months ago by zbidav ▴ 30

score 4 · Accepted Answer · 2024-07-10

4

Entering edit mode

4 months ago

i.sudbery 20k

In the 10X protocol the reads are fragmented after PCR. This means that two fragments from the same original RNA molecule can have different 5' ends. As 10X sequences the 5' end of the fragments, this means they can have different sequences even though they arose from the same starting molecule.

If you are using UMI-tools to do your deduplication, then you need to use the --per-gene switch to use the gene_id, rather than the genomic position, to identify duplicate reads.

To see how this might happen. Imagine that you have three mRNAs from the same gene (shown here in different colours):

enter image description here

We use RT to attach UMIs (shown after the polyA site) and CBs etc, and then amplify with PCR:

enter image description here

Now we use random tagmentation to fragment the molecules and add sequencing primers (shown in black):

enter image description here

Finally we sequence these fragments from their 5' sequencing primers. Note how you will get many different 5' ends that came from the same original mRNA molecule (same colour, same UMI).

ADD COMMENT • link 4 months ago by i.sudbery 20k

0

Entering edit mode

Thank you so much! This clarifies indeed what I observe:) '

May I please ask a few additional questions?

Can you please add about polyT/TSO location? I assume that they are just after the barcodes you drew?
Maybe I am still confused a bit, but how is the molecule sequenced from the left (the sequence) and then from the right (the barcodes)? or is it done using paired reads when the sequence receives (pair1) ~100bp and the barcodes ~50bp (pair 2) using the Illumina bridge PCR? I apologize for my confusion. I just couldn't find their manual regarding this, so I am guessing a bit.
Filtering by gene indeed creates accurate, clean data, but... I maintain, at best, about 15% of the original data. Is it correct?

I used the following UMI-tools command:

umi_tools dedup --stdin=$input_file --log=$out_log --extract-umi-method=tag --umi-tag=UB --cell-tag=CB --gene-tag GN --method=unique --per-cell --per-gene  > $out_bam

ADD REPLY • link 4 months ago by zbidav ▴ 30

1

Entering edit mode

See image below
Yes, the fragment is sequenced by paired end sequencing using brige PCR. Again, see image below. The manual you are looking for is here: https://cdn.10xgenomics.com/image/upload/v1710230668/support-documents/CG000732_ChromiumGEM-X_SingleCell3_ReagentKitsv4_CellSurfaceProtein_UserGuide_RevA.pdf (technically, this includes cell surface protein library prep as well, but you can just ignore that bit).
Duplication rates vary wildly by experiment, by 15% unique reads is not massively out of the ordinary. However, I recomend that you don't use --method=unique unless you have a very good reason for it. This treats all UMIs as different even if they only differ by a single base. Such UMI pairs are likely to be cause by PCR or sequencing error, particulalry in an experimnet with a high duplication rate, like this one.

enter image description here

ADD REPLY • link 4 months ago by i.sudbery 20k

0

Entering edit mode

Thanks! I was under the impression that the UB tag in the BAM file should be the corrected one (compared to the original UR tag, at least in 10x cellRanger). But... I understand that I cannot make that assumption :(

Thanks again for the very informative illustration!

ADD REPLY • link 4 months ago by zbidav ▴ 30