Question

UMI deduplication after common sequence (QIAseq miRNA Library Kit)

1

Entering edit mode

4.0 years ago

lluc.cabus ▴ 20

Hi everyone,

I have the fastq files for some miRNA libraries prepared with the QIAseq miRNA Library Kit. I have to do the UMI extraction, but the problem is that the UMI is after a common sequence for all the reads, such as this:

NNNNNNNNNNNNNNNNNNNAACTGTAGGCACCATCAAT*XXXXXXXXXXXX*NNNNNNNNN

Where the N are the miRNA sequences, the bold part is the common sequence for all the reads and the part with all the X is the part with the UMI sequence.

How could I remove the bold part and append the UMI to the header of the fastq file? The problem is that I have seen that around 3-5% of the reads don't have the common sequence, I suppose that there are sequencing errors and some part of this sequence is changed in some reads, but I don't know how to accept one letter change in the common part.

Thank you very much!

RNA-seq miRNA UMI • 4.1k views

ADD COMMENT • link updated 4.0 years ago by GenoMax 151k • written 4.0 years ago by lluc.cabus ▴ 20

2

Entering edit mode

For future visitors: While this question has been solved, QIAGEN makes a set of web based tools available (appear to be free as of this writing) called GeneGlobe (LINK).

If you are not able to make use of umi-tools on command line then you can try GeneGlobe for analysis of QIAseq miRNA data. Handbook for QIAseq library kit has information on how to use.

ADD REPLY • link 4.0 years ago by GenoMax 151k

0

Entering edit mode

You've got two sets of Ns here - one at the start and one at the end. Are they both miRNA sequences? If not, is it the 3' or the 5' Ns that are the miRNA sequence?

ADD REPLY • link 4.0 years ago by i.sudbery 21k

score 7 · Accepted Answer · 2021-05-07

You should be able to do this with UMI tools using the regex UMI extraction mode.

you can do something like:

umi_tools extract --extract-method=regex \
                  --bc-pattern=".+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12}).+" \
                   -I input.fastq.gz \
                   -S processed.fastq.gz

The {s<=2} means "allow two mismatches in the common sequence". Note that this will leave both the Ns at the start of the sequence and the Ns at the end of the sequence intact. If you wish to remove the Ns at the end then you the regex: .+(?P<discard_1>AACTGTAGGCACCATCAAT){s<=2}(?P<umi_1>.{12})(?P<discard_2>.{9})

See more details here: https://umi-tools.readthedocs.io/en/latest/regex.html#regex-regular-expression-mode