I am trying to analyze some single embryo RNA-seq data. We use the Smart-seq3 protocol at our NGS facility and I have received the index demultiplexed files from our facility. I would like to deduplicate reads using the UMI sequences. From what I understand this is what a read from the 5' end of the transcripts look like:
If I understand correctly (from this), read 1 primer sits on the bottom strand at the s5-ME sequence and extends. So read 1's 5' end contains the 5'fragment tag - UMI - cDNA - ME - s7
sequences (in that order). I can specify --bc-pattern=CCCCCCCCCCCNNNNNNNN
to extract the fragment tag (11bp) as the cell barcode and the 8bp UMI separately from the 5' of read 1 for next step of deduplication. Now, my confusion is with read 2. Read 2 primer sits on the top strand at s7-ME sequences and extends till ME-s5. Therefore, it looks like cDNA - UMI - 5'fragment tag - ME - s5
.
My question is will UMI-tools (paired-end mode) be able to remove the detect and extract UMI sequences from the 3' end of read 2? Or do I need to specify --bc-pattern2
using regex and specify that read 2 has 3' end UMIs? If so, could what would be the regex pattern (I am still learning regex and not great at it).
For completeness of info, I have already trimmed my demultiplexed reads (using trimgalore default) so there are no adapter sequences on 3' ends of both reads. So, read 1 looks like : 5'fragment tag - UMI - cDNA
and read 2 : cDNA - UMI - 5'fragment tag
.
Okay, so from what I get, both the read 1 and read 2 of a pair get the UMI barcode extracted from read 1? I did a quick grep search and you're right, majority of read 1 contains the 5' fragment tag but little of read 2 contains the 5' fragment tag. I don't think I will trim read 2 because, well, I'm unsure if it will make a big difference to my alignment output. Thanks so much for the clarification!