Hi all,
I have 4 SRR files for the same sample (subject) produced by 10X single-cell sequencing. I let them go through cell ranger together. Then I use pysam to read output BAM file. I observe that there exists many reads with the same corrected UMI, same corrected CB, and those from different lanes have different read sequences.
- So I'm wondering if in 10X, they can assign the same UMI to different lanes in one run?
- And how do they define duplicated reads across lanes?
SRR16092728.4385593 1024 0 29954 255 31S85M524N16M -1 -1 101 AAGCAGTGGTATCAACGCAGAGTACATGGGGAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTTGTGTTCAACGGTTTTGTTCGCCTTCCC array('B', [32, 32, 32, 32, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 14, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 27, 36, 36, 36, 27, 36, 36, 36, 32, 32, 36, 36, 36, 27, 36, 32, 32, 36, 36, 36, 27, 36, 36, 32, 32, 36, 36, 36, 32, 32, 27]) [('NH', 1), ('HI', 1), ('AS', 95), ('nM', 3), ('ts', 30), ('TX', 'ENST00000473358,+401,31S101M'), ('GX', 'ENSG00000243485'), ('GN', 'MIR1302-2HG'), ('fx', 'ENSG00000243485'), ('RE', 'E'), ('xf', 17), ('CR', 'CGCTTCACAGTTCATG'), ('CY', 'AAAAAEAAEEAEEEEE'), ('CB', 'CGCTTCACAGTTCATG-1'), ('UR', 'GTACCACAAT'), ('UY', 'EEEEEEEEEE'), ('UB', 'GTACCACAAT'), ('RG', 'GSE1848781:0:1:unknow_flowcell:0')]
SRR16092726.7359035 0 0 29954 255 31S85M524N16M -1 -1 101 AAGCAGTGGTATCAACGCAGAGTACATGGGGAGAATAGTCAAAATTCACAGAGACAGAAGCAGTGGTCGCCAGGAATGGGGAAGCAAGGCGGAGTTGGGCAGCTTGTGTTCAACGGTTTTGTTCGCCTTCCC array('B', [32, 32, 32, 32, 32, 36, 36, 36, 21, 36, 36, 36, 36, 14, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 27, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 27, 36, 36, 36, 36, 36, 32, 36, 36, 36, 14, 32, 36, 27, 36, 36, 32, 32]) [('NH', 1), ('HI', 1), ('AS', 95), ('nM', 3), ('ts', 30), ('TX', 'ENST00000473358,+401,31S101M'), ('GX', 'ENSG00000243485'), ('GN', 'MIR1302-2HG'), ('fx', 'ENSG00000243485'), ('RE', 'E'), ('xf', 25), ('CR', 'CGCTTCACAGTTCATG'), ('CY', 'AAAAAEEEEEEEEEEE'), ('CB', 'CGCTTCACAGTTCATG-1'), ('UR', 'GTACCACAAT'), ('UY', 'EEEEEAEEEE'), ('UB', 'GTACCACAAT'), ('RG', 'GSE1848781:0:1:unknow_flowcell:0')]
SRR16092727.3015987 0 0 30562 1 1S105M308N26M -1 -1 131 GGTTTTGTTCGCCTTCCCTGCCTCCTCTTCTGGGGGAGTTAGATCGAGTTGTAACAAGAACATGCCACTGTCTCGCTGGCTGCAGCGTGTGGTCCCCTTACTAGAGTGAGGATGCGAAGAGAAGGTGACTGT array('B', [32, 32, 32, 32, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 32, 36, 32, 36, 36, 36, 32, 14, 36, 36, 27, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 36, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 36, 14, 36, 32, 36, 32, 36, 36, 36, 36, 36, 36, 36, 36, 32, 36, 36, 21, 32, 36, 36, 14]) [('NH', 4), ('HI', 4), ('AS', 125), ('nM', 3), ('TX', 'ENST00000469289,+296,1S131M'), ('GX', 'ENSG00000243485'), ('GN', 'MIR1302-2HG'), ('fx', 'ENSG00000243485'), ('RE', 'E'), ('xf', 0), ('CR', 'CGCTTCACAGTTCATG'), ('CY', 'AAAAAEEEEEEEEEEE'), ('CB', 'CGCTTCACAGTTCATG-1'), ('UR', 'GTACCACAAT'), ('UY', 'EEEEEEEEEE'), ('UB', 'GTACCACAAT'), ('RG', 'GSE1848781:0:1:unknow_flowcell:0')]
Thanks for your help.
Thanks for your respond. Then from the above example, each row is each read from each lane. The first read is duplicated read of the second row (they have identical sequence and flag=1024 for the first row), while the third read is something which is totally different from the other two.
These samples (e.g. https://www.ncbi.nlm.nih.gov/sra/?term=SRR16092725 ) are sequenced on a NextSeq 500 so there is only one pool that ran on all 4
lanes
. These lanes are optically separate but not physically so.You may find this tutorial useful: https://www.10xgenomics.com/resources/analysis-guides/tutorial-navigating-10x-barcoded-bam-files
I'm using this dataset https://www.ncbi.nlm.nih.gov/Traces/study/?acc=PRJNA766693&o=acc_s%3Aa
Ah, I think UMI is assigned before sample is splitted into different lane https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq Then the question is why we have the same UMI for different reads which might come from different molecules (not all reads have flag-1024)
UMI's go with the corresponding read 2. Structure of 10x libraries is shown here: https://kb.10xgenomics.com/hc/en-us/articles/360000939852-What-is-the-difference-between-Single-Cell-3-and-5-Gene-Expression-libraries-
R1 (26-28 bp depending on kit version) - Contains UMI+Cell barcode
R2 (90 bp) - Actual cDNA sequence
Thanks for your information. I mean the total unique UMI is much larger than my data which has 500mil reads, but we still have duplicated UMI when preparing library for reads that are from different molecules?