Question

Marking duplicates using UMIs

0

Entering edit mode

14 months ago

Lipika • 0

Hi,

I am currently preprocessing my fastq files to make analysis ready BAMs. I received cram files from the centre, for which I have done following steps so far.

1. Cram2fastq
2. Split fastq (as they were sequenced on multi lanes)
3. Trim adaptors and move UMIs to headers ( using TRIMMER from AGENT NGS tools - we used these library kits).
4. Alignment and sort (of each splited BAMs)
5. Merge Bams of each sample.

Now, I want to mark duplicates in my Bams, for which I was looking for tools. So far, I could find few tools, what I have tried and learnt:

UmiAwareMarkDuplicatesWithMateCigar (Picard) - Initially, I missed splitting my fastq, so it has only one RG in header and reads were not tagged with RG, it was working (took ~10-11 hrs). But now, when I corrected this splitting step, it does not work (I submit the command, it runs for days, but no success - neither gives me error, nor result, just running and does not stop).
fgbio - For this, it gives me error of not having MQ tags, even after SetMateInformation (my Bams has MQ tags after using this). (attaching ss)
umitools dedup - taking ages to complete as I have WGS data. So, still not sure about the results, what and when will I have.
CREAK - by AGENT NGS tools. Gives lack of memory error everytime even if using for subset of my reads, running on cluster.

RG header of my bam

Bam file after merging (step5)

Bam file after SetMateInformation for fgbio

fgbio error

Are there any other tools which can work with UMIs to mark duplicates only (not generate consensus)? Or if anybody has used these tools and had success, can give me some suggestions or comments if I am doing something wrong?

Thanks,

Lipika

UMI Deduplication • 1.5k views

ADD COMMENT • link 3 months ago by Lipika • 0

1

Entering edit mode

UMI-tools is taking a long time as it does error correction of the UMI sequences by default, which the other methods don't do (by default). You can disable this by choosing one of the other UMI resolution methods using the --method switch.

--method=unique is the quickest and most naive resolution method - just assumes each read with a unique mapping coordinate and umi combination is unique - doesn't attempt to do anything to correct for errors, but is very fast.

Also dedup removes duplicates, it doesn't mark them.

umi-tools group does something similar to marking (it assigns each read to a group of reads that share the same mapping coordinates and implied UMI). However, it only annotates this group on read 1.

ADD REPLY • link 14 months ago by i.sudbery 21k

0

Entering edit mode

If I used --method=unique, would it be different from using MarkDuplicates by Picard using BARCODE_TAG for UMIs?

ADD REPLY • link 14 months ago by Lipika • 0

1

Entering edit mode

Yeah, I expect it would be more or less the same.

ADD REPLY • link 14 months ago by i.sudbery 21k

0

Entering edit mode

Why are you splitting by lanes? If you get the same read with the same UMI in two lanes, you don't want to keep them with, do you?

ADD REPLY • link 14 months ago by swbarnes2 15k

0

Entering edit mode

I want to use BQSR later, which is read group aware, thats why I am splitting them.

ADD REPLY • link 14 months ago by Lipika • 0