Question

UMI and CB for the same sample but different lane by 10X single-cell RNA sequencing

0

Entering edit mode

2.4 years ago

tien ▴ 40

Hi all,

I have 4 SRR files for the same sample (subject) produced by 10X single-cell sequencing. I let them go through cell ranger together. Then I use pysam to read output BAM file. I observe that there exists many reads with the same corrected UMI, same corrected CB, and those from different lanes have different read sequences.

So I'm wondering if in 10X, they can assign the same UMI to different lanes in one run?
And how do they define duplicated reads across lanes?

SRR16092728.4385593 1024    0   29954   255 31S85M524N16M   -1  -1  101 AAGCAGTGGTATCAACGCAGAGTACATGGGGAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTTGTGTTCAACGGTTTTGTTCGCCTTCCC    array('B', [32, 32, 32, 32, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 14, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 27, 36, 36, 36, 27, 36, 36, 36, 32, 32, 36, 36, 36, 27, 36, 32, 32, 36, 36, 36, 27, 36, 36, 32, 32, 36, 36, 36, 32, 32, 27])    [('NH', 1), ('HI', 1), ('AS', 95), ('nM', 3), ('ts', 30), ('TX', 'ENST00000473358,+401,31S101M'), ('GX', 'ENSG00000243485'), ('GN', 'MIR1302-2HG'), ('fx', 'ENSG00000243485'), ('RE', 'E'), ('xf', 17), ('CR', 'CGCTTCACAGTTCATG'), ('CY', 'AAAAAEAAEEAEEEEE'), ('CB', 'CGCTTCACAGTTCATG-1'), ('UR', 'GTACCACAAT'), ('UY', 'EEEEEEEEEE'), ('UB', 'GTACCACAAT'), ('RG', 'GSE1848781:0:1:unknow_flowcell:0')]
SRR16092726.7359035 0   0   29954   255 31S85M524N16M   -1  -1  101 AAGCAGTGGTATCAACGCAGAGTACATGGGGAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTTGTGTTCAACGGTTTTGTTCGCCTTCCC    array('B', [32, 32, 32, 32, 32, 36, 36, 36, 21, 36, 36, 36, 36, 14, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 27, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 36, 36, 32, 36, 36, 36, 14, 32, 36, 27, 36, 36, 32, 32])    [('NH', 1), ('HI', 1), ('AS', 95), ('nM', 3), ('ts', 30), ('TX', 'ENST00000473358,+401,31S101M'), ('GX', 'ENSG00000243485'), ('GN', 'MIR1302-2HG'), ('fx', 'ENSG00000243485'), ('RE', 'E'), ('xf', 25), ('CR', 'CGCTTCACAGTTCATG'), ('CY', 'AAAAAEEEEEEEEEEE'), ('CB', 'CGCTTCACAGTTCATG-1'), ('UR', 'GTACCACAAT'), ('UY', 'EEEEEAEEEE'), ('UB', 'GTACCACAAT'), ('RG', 'GSE1848781:0:1:unknow_flowcell:0')]
SRR16092727.3015987 0   0   30562   1   1S105M308N26M   -1  -1  131 GGTTTTGTTCGCCTTCCCTGCCTCCTCTTCTGGGGGAGTTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCTTACTAGAGTGAGGATGCGAAGAGAAGGTGACTGT    array('B', [32, 32, 32, 32, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 32, 36, 32, 36, 36, 36, 32, 14, 36, 36, 27, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 21, 32, 36, 36, 14])    [('NH', 4), ('HI', 4), ('AS', 125), ('nM', 3), ('TX', 'ENST00000469289,+296,1S131M'), ('GX', 'ENSG00000243485'), ('GN', 'MIR1302-2HG'), ('fx', 'ENSG00000243485'), ('RE', 'E'), ('xf', 0), ('CR', 'CGCTTCACAGTTCATG'), ('CY', 'AAAAAEEEEEEEEEEE'), ('CB', 'CGCTTCACAGTTCATG-1'), ('UR', 'GTACCACAAT'), ('UY', 'EEEEEEEEEE'), ('UB', 'GTACCACAAT'), ('RG', 'GSE1848781:0:1:unknow_flowcell:0')]

Thanks for your help.

10X RNA single-cell • 1.8k views

ADD COMMENT • link 2.4 years ago by tien ▴ 40

score 2 · Answer 1 · 2022-06-21

2

Entering edit mode

2.4 years ago

GenoMax 147k

Lanes concept is sequencer dependent. There are flowcells where a lane is a physical entity (not conncted to other lanes on same flowcell, e.g. NovaSeq/HiSeq). Whereas with other sequencers lanes are optical entities (scanned independently, though they are physically connected to each other) (e.g NextSeq).

If the same library ran on those lanes the same UMI's can exist in different lanes.

ADD COMMENT • link 2.4 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your respond. Then from the above example, each row is each read from each lane. The first read is duplicated read of the second row (they have identical sequence and flag=1024 for the first row), while the third read is something which is totally different from the other two.

I'm not sure how cell ranger knows that 1st and 2nd are from the same molecules while 3rd is not (its flag=0). Is UMI assigned before or after sample is splitted into lanes? If after, then it seems to coincidentally assign the same UMI to the same molecules (1st and 2nd)? If before, then how can we have the same UMI for different molecules?
Also, I'm not sure if I should interpret these four separate files (SRR) as 4 lanes? They definitely are from the same sample and have different size.

ADD REPLY • link 2.4 years ago by tien ▴ 40

1

Entering edit mode

These samples (e.g. https://www.ncbi.nlm.nih.gov/sra/?term=SRR16092725 ) are sequenced on a NextSeq 500 so there is only one pool that ran on all 4 lanes. These lanes are optically separate but not physically so.

You may find this tutorial useful: https://www.10xgenomics.com/resources/analysis-guides/tutorial-navigating-10x-barcoded-bam-files

ADD REPLY • link 2.4 years ago by GenoMax 147k

0

Entering edit mode

I'm using this dataset https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA766693&o=acc_s%3Aa

ADD REPLY • link 2.4 years ago by tien ▴ 40

0

Entering edit mode

Ah, I think UMI is assigned before sample is splitted into different lane https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq Then the question is why we have the same UMI for different reads which might come from different molecules (not all reads have flag-1024)

ADD REPLY • link 2.4 years ago by tien ▴ 40

0

Entering edit mode

UMI's go with the corresponding read 2. Structure of 10x libraries is shown here: https://kb.10xgenomics.com/hc/en-us/articles/360000939852-What-is-the-difference-between-Single-Cell-3-and-5-Gene-Expression-libraries-

R1 (26-28 bp depending on kit version) - Contains UMI+Cell barcode
R2 (90 bp) - Actual cDNA sequence

ADD REPLY • link 2.4 years ago by GenoMax 147k

0

Entering edit mode

Thanks for your information. I mean the total unique UMI is much larger than my data which has 500mil reads, but we still have duplicated UMI when preparing library for reads that are from different molecules?

ADD REPLY • link 2.4 years ago by tien ▴ 40