Hello,
I have an example fq file (9_R2.fq.gz) that contains the following read:
GAACCAACCTCTTCGGGTCTATATAGATAGGAAGAGCG...
amongst others. I used the following code to extract cell barcodes and UMI from the reads:
umi_tools extract --extract-method=regex \
–bc-pattern='(?P<discard_1>.*)(?P<cell_1>.{8})(?P<umi_1>.{15}).*' \
--stdin=9_R2.fq.gz --read2-in=9_R1.fq.gz
--stdout=9_R2_ext.fq.gz --read2-out=9_R1_ext.fq.gz
--whitelist=whitelist.txt --log=9_extract.log
whitelist.txt file is a single-column file containing a list of cell barcodes. Amongst others, ACCTCTTC is one of the lines in the file.
According to my understanding, since I am using a regex where the beginning discard_1 is any number of nucleotides, and cell_1 is an 8 nt that matches with one of the whitelist barcodes, the read GAACCAACCTCTTCGGGTCTATATAGATAGGAAGAGCG should be a match (because position 7-12 matches the whitelist barcode). But this is not the case.
- Is my understanding of regex pattern wrong?
- Was the read discarded because of other issues like quality?
Thanks for the detailed reply. I am currently doing barcode-wise extract in a script.
But your last line made me realise my stupidity. In my data, following the cell barcode and UMI, there was a universal adapter (Illumina) - almost in 100% of the reads in all the samples. This is because, these were intentionally used in the constructs to facilitate subsequent PCR amplification with primers designed to attach to the adapter region.
In auto-pilot mode, I had simply "trimmed" the adapters from my reads and was coming up with a solution to find matches!
I am now considering using the untrimmed data and use the following regex:
This makes sense, right?
Would it not make more sense to use the trimmed read, and assume the UMI and cellbarcode were at the end of the read?
As you have it at the moment, you would need an exact match to the complete adaptor sequence, where as you might only have part of the adaptor sequence at the end of the reads. Adding
.*
to the end even suggests that you expect there to be sequence AFTER the adaptor. What sequence could their be after the adaptor sequence?This was initially my thought as well. But then, I ran a few samples and found that there were actually some matches that were in the middle. I wanted to make sure I didn't miss these.
The read length is much longer the length(cell barcode)+length(UMI)+length(adapter) - so I see a lot of noisy reads before / after the construct - specially in cases where the initial nucleic acid material was of poor quality.
So, I was wondering if using the raw data and adapting the regex pattern to match a partial adapter sequence (say 10 nt) and combining it with quality score >20 for UMI region might be a good workaround?
Correct me if I am wrong.
EDITED TO ADD: Just writing my thoughts down makes me question if I even want to consider these low-quality noisy matches. I stand corrected. I will use the beginning / end regex to identify my matches like the following:
Adaptor trimming software is far better tuned to identifying adaptor sequences than a regex ever will be. If the adaptor trimmer can't identify and trim the 3' adatpor (and therefore place the barcode/umi region at the end of the read), then it won't be picked up by a regex either.
Thank you both for your suggestions, my code runs orders of magnitude faster now.