Question

Regex in umi-tools doesn't identify pattern in non-beginning positions

0

Entering edit mode

5 weeks ago

kalaivani ▴ 10

Hello,

I have an example fq file (9_R2.fq.gz) that contains the following read:

GAACCAACCTCTTCGGGTCTATATAGATAGGAAGAGCG...

amongst others. I used the following code to extract cell barcodes and UMI from the reads:

umi_tools extract --extract-method=regex \
–bc-pattern='(?P<discard_1>.*)(?P<cell_1>.{8})(?P<umi_1>.{15}).*' \
--stdin=9_R2.fq.gz --read2-in=9_R1.fq.gz 
--stdout=9_R2_ext.fq.gz --read2-out=9_R1_ext.fq.gz 
--whitelist=whitelist.txt --log=9_extract.log

whitelist.txt file is a single-column file containing a list of cell barcodes. Amongst others, ACCTCTTC is one of the lines in the file.

According to my understanding, since I am using a regex where the beginning discard_1 is any number of nucleotides, and cell_1 is an 8 nt that matches with one of the whitelist barcodes, the read GAACCAACCTCTTCGGGTCTATATAGATAGGAAGAGCG should be a match (because position 7-12 matches the whitelist barcode). But this is not the case.

Is my understanding of regex pattern wrong?
Was the read discarded because of other issues like quality?

whitelist UMI-Tools UMI UMI-TOOLS umi-tools • 597 views

ADD COMMENT • link updated 4 weeks ago by i.sudbery 21k • written 5 weeks ago by kalaivani ▴ 10

score 2 · Answer 1 · 2025-02-13

2

Entering edit mode

4 weeks ago

i.sudbery 21k

Your understanding is correct. UMI tools runs the the regex to extract the cell barcode and umi and then checks the cell barcode against the whitelist. This is at least partly because it is way more efficient to do this, but also because the original UMI-tools didn't handle cell barcodes or whitelists at all, and indeed, many people still run UMI-tools without a barcode whitelist, even when they have cell barcodes.

Moving to what it is you actaully want to do, logically there are is only one way to do it: take each barcode in turn and ask if it is in the sequence. If you wrote seperate code to do this (as in re-wrote extract) you would basically just have to do what you are currently doing with a script.

Might get rid of some of the overhead (and get all the output in a single file) by combining all the options for cell barcodes into a single, horrific regex. You would combine two like so:

'(?P<discard_1>.*)(?P<cell_1>ACCTCTTC|GTTAGAAT)(?P<umi_1>.{15}).*'

Combining hundreds would be similar, but just produce a not very nice looking string. The only question would whether the regex would be too long for the regex compiler to handle. I don't think there is a stated maximum length for regexs in the regex module, which is what UMI-tools uses, but I suspect there is a non-documented maximum length at which the compiler just won't be able to handle it.

Do you really not know anything about the sequence that comes before the cell barcode or after the UMI?

ADD COMMENT • link 4 weeks ago by i.sudbery 21k

0

Entering edit mode

Thanks for the detailed reply. I am currently doing barcode-wise extract in a script.

But your last line made me realise my stupidity. In my data, following the cell barcode and UMI, there was a universal adapter (Illumina) - almost in 100% of the reads in all the samples. This is because, these were intentionally used in the constructs to facilitate subsequent PCR amplification with primers designed to attach to the adapter region.

In auto-pilot mode, I had simply "trimmed" the adapters from my reads and was coming up with a solution to find matches!

I am now considering using the untrimmed data and use the following regex:

'(?P<discard_1>.*)(?P<cell_1>.{8})(?P<umi_1>.{15})(?P<discard_2>${ADAPTER_SEQUENCE}).*'

This makes sense, right?

ADD REPLY • link 4 weeks ago by kalaivani ▴ 10

1

Entering edit mode

Would it not make more sense to use the trimmed read, and assume the UMI and cellbarcode were at the end of the read?

(?P<discard_1>.*)(?P<cell_1>.{8})(?P<umi_1>.{15})$

As you have it at the moment, you would need an exact match to the complete adaptor sequence, where as you might only have part of the adaptor sequence at the end of the reads. Adding .* to the end even suggests that you expect there to be sequence AFTER the adaptor. What sequence could their be after the adaptor sequence?

ADD REPLY • link 4 weeks ago by i.sudbery 21k

0

Entering edit mode

This was initially my thought as well. But then, I ran a few samples and found that there were actually some matches that were in the middle. I wanted to make sure I didn't miss these.

The read length is much longer the length(cell barcode)+length(UMI)+length(adapter) - so I see a lot of noisy reads before / after the construct - specially in cases where the initial nucleic acid material was of poor quality.

So, I was wondering if using the raw data and adapting the regex pattern to match a partial adapter sequence (say 10 nt) and combining it with quality score >20 for UMI region might be a good workaround?

Correct me if I am wrong.

EDITED TO ADD: Just writing my thoughts down makes me question if I even want to consider these low-quality noisy matches. I stand corrected. I will use the beginning / end regex to identify my matches like the following:

^(?P<discard_1>.T)(?P<cell_1>.{8})(?P<umi_1>.{15}).*

(?P<discard_1>.*)(?P<cell_1>.{8})(?P<umi_1>.{15})$

ADD REPLY • link 4 weeks ago by kalaivani ▴ 10

1

Entering edit mode

Adaptor trimming software is far better tuned to identifying adaptor sequences than a regex ever will be. If the adaptor trimmer can't identify and trim the 3' adatpor (and therefore place the barcode/umi region at the end of the read), then it won't be picked up by a regex either.

ADD REPLY • link 4 weeks ago by i.sudbery 21k

1

Entering edit mode

Thank you both for your suggestions, my code runs orders of magnitude faster now.

ADD REPLY • link 4 weeks ago by kalaivani ▴ 10

score 1 · Answer 2 · 2025-02-12

1

Entering edit mode

5 weeks ago

swbarnes2 14k

You could try without the discard; the examples in the tutorial (https://umi-tools.readthedocs.io/en/latest/regex.html) only use that when they want to specify the sequence to be discarded. From the tutorial a sample regex

^(?P<umi_1>.{4}).+(?P<umi_2>.{4})(?P<cell_1>.{3})$

So you could try just '.*(?P<cell_1>.{8})(?P<umi_1>.{15}).*'

But I'm not entirely sure that umi_tools is supposed to work like this. You are going to try every single white list entry in the first, 20 bases? What if one starts at base 3, and another starts at base 10?

In the inDrop example in the tutorial, there is a single fixed sequence, and the cell barcodes flank it. So it looks for the sequence, then figures out the barcodes based on the fixed sequence. But you don't know the barcode, or where it is. Seems to me you have to be sure of one, or the other for this to work.

ADD COMMENT • link 5 weeks ago by swbarnes2 14k

0

Entering edit mode

You are right.

With a bit of back and forth, I figured that umi_tools first matches the regex and then discards / keeps depending on whether the cell barcode is available in the whitelist.

What I ideally want is for umi_tools to look at the whitelist candidates and use that in the regex <cell_1> - so that it is able to find matches. But from my recent understanding, this is not what umi_tools does.

So, I have now written a script that loops through the lines of whitelist and uses the barcode sequence in the regex:

'(?P<discard_1>.*)(?P<cell_1>${BARCODE})(?P<umi_1>.{15}).*'

and this way, I am able to find matches.

The downside is that I search multiple times for the same sample - one for each barcode.

Thankfully my barcodes are in the hundreds and not higher.

If there is a more efficient way to do this, it will be great!

ADD REPLY • link 4 weeks ago by kalaivani ▴ 10