Hello,
I have a single-cell protocol in which I have UMI (6nts long). I have individual fastq file for each individual cell sequenced single end 150bp using Nextera library prep. I am trying to extract the UMI information but have several questions.
1) When fragmentation occurs after PCR amplification, how it is possible to deduplicate the reads, since some reads will not be attached to the UMIs anymore, but would have initially come from a read with UMI? Are the reads with no UMI completely discarded from any analysis even though they may have been part of a read with a UMI initially?
2) Using regex I was able to extract UMI. I then trimmed, mapped, counted the number of genes using featureCounts and use umi_tools dedup to get information. From then it is not clear to me what the information actually means. Out of 4509 UMIs, 2182 needed to be deduplicated? How does this give information about my library?
Output from umi_tools extract:
INFO Input Reads: 31466543
INFO regex does not match read1: 29855051
INFO regex matches read1: 1611492
INFO Reads output: 1611492
Output from featureCounts:
Assigned 3868
Unassigned_Unmapped 221558
Output from umi_tools dedup:
INFO total_umis 4509
INFO #umis 1068
INFO Reads: Input Reads: 4509
INFO Number of reads out: 3212
INFO Total number of positions deduplicated: 2182
Mean number of unique UMIs per position: 1.71
INFO Max. number of unique UMIs per position: 51
Thank you for your help
EDIT:
I am using the MATQ-Seq protocol. Sadly, the paper is not very clear with what is done with the UMIs in this protocol, neither the scripts are available.
I am also doing single cell bacteria in which having a low mapping rate is kind of normal. I usually get 10% of my total reads mapping to the genome.
Are you sure your UMI is still part of the read? I am asking because you mentioned that you have a 150bp single-end read, and your data is already split to cell-specific FastQs.
Several single-cell protocols I know (e.g. the CEL-seq2 protocol) use paired-end sequencing and one read of the pair contains exclusively the cell-barcode & UMI while the second read comprises the actual cDNA sequence. So to me, it seems likely that your data is already demultiplexed and deduplicated?
Are you aware of this dedicated umi-tools guide for single-cell data?