Hi everyone,
I am using UMIs for the first time and the library design is such that the UMI barcode exists between the P7 cluster forming adapter sequence and the P7 seq primer binding adapter sequence. This means that I need to UMI extract and deduplicate using information from the 3' end of the R1 reads.
I am trying to use umi_tools for this purpose, but I found that pattern matching was yielding far fewer matches than expected. When I assessed BaseQ of the reads 3' ends, I saw a pattern illustrated by the image, where the BaseQ dramatically decreases when the adaptor region is reached.
I am wondering if this is a common issue and what the cause could be. First, it's puzzling to me to choose to sequence UMIs at the end of the reads where baseQ drop normally occurs, but I think that in this case, the magnitude of the drop is far beyond the normal reduction from phasing issues. Could anyone make a suggestion on what to test for and how to proceed? I do not think I can use these reads with such low quality UMIs.
Thank you,
Alex
My understanding from the OP's post is that the UMIs are AFTER the adaptor , not before it (and therefore would be after the drop in quality, not before it).
If that understanding is correct, then the problem would that it would be difficult to find the UMI without an intact adaptor sequence to register to.
My understanding was that when UMIs were in this position, the idea was that they would be present in the barcode read, not either of the insert reads.
If that is the case then it should be possible to get the index reads in a separate file and then the data can be trimmed normally. Would
umi_tools
be able to use the UMI's present in the index read file assuming the basecalls are not compromised?Yeah, shouldn't be s problem.