I am trying to use UMI tools to create a genes x counts matrix for a single-cell RNA-Seq dataset. This can be done using the umi_tools count
command.
Below are two (sam formatted) lines from my input file. Read assignment status is denoted using the XS flag and gene ID is denoted using the XT flag.
A00303:172:HKJVWDRXX:2:2265:15555:10614_AGCATTCGAGATCGCAAATCCGTCATCCAAGATCGCAGTGGCC_CGATCGGGAA 16 1 4912891 3 109M42S * 0 0 TGACTGTCCTGGAACTCACTCTGTAGACCAGGCTGGCCTCGAATTCAGAAATCCACCTGCCTCTGCCTCCCAAGTGCTGGGATTAAAGGCATGTGCCACCACTGTCCGGTGAAACTGGGAGTTTTAACCAACTCCACTTGCTCTACTGGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF NH:i:2 HI:i:2 AS:i:99 nM:i:4 XS:Z:Assigned XN:i:1 XT:Z:Rgs20
A00303:172:HKJVWDRXX:1:2110:16034:12633_AGCATTCGTATCAGCAGGAGAACAATCCAACAGCAGAGTGGCC_CACAATTGGC 272 1 5267364 0 28S121M * 0 0 GCCAGAGCATTCGTATCAGCATTTTTTTTTTTTTGTGTTTAGGAAATTGTATCTTAGATCTTGGGTATCTTAGGTTTTGGGCTAATATCCACTTATCAGTGAGTACATATTGTGTGAGTTCCTTTGTGAATGTGTTACCTCACTCAGGA FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFF:FFFFFFF:FFFFFFFF,FFFF,FFFFFFFFFFFFFFFFFFFFFFF:F,FFFFF:FFF:FFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFF: NH:i:7 HI:i:4 AS:i:113 nM:i:3 XS:Z:Unassigned_NoFeatures
I know the dataset has the following numbers of assigned and unassigned reads.
Total reads: 8449032 Assigned reads: 7446773 Unassigned reads: 619281
To count this dataset i am using the following command:
umi_tools count --wide-format-cell-counts --per-gene --gene-tag=XT --assigned-status-tag=XS --per-cell -I assigned_sorted.bam -S counts.tsv.gz
However i get the following output with only ~30,000 reads tallied.
INFO Input Reads: 8449032
INFO Read skipped, no tag: 1002259
INFO Number of reads counted: 30971
Does anyone know how i can improve the percentage of assigned reads that are tallied using umi_tools count
?
When I add the
--ignore-umi
option i get a similarly low number of reads (22,727). I will try to extract the number of counts and unique UMIs to confirm that this is not the issue.When you
--ignore-umi
you will only keep one read per position. Because here your "positions" are genes, rather than co-ordinates, it will only keep one read per gene.Looking at the number of unique UMIs will help, but not completely solve your problem, because UMI-tools will collapse UMIs that are different if certain criteria are met: 1) The two UMIs are different at fewer than (by default) 1 position and 2) The number of reads with UMI1 is more than twice that of UMI2.
The assigned reads bit look right to me: 8,449,031 total reads of which 7,446,773 are assigned, leaves approx 1 million unassigned. Which is what UMI-tools reports.