Hi,
I am currently preprocessing my fastq files to make analysis ready BAMs. I received cram files from the centre, for which I have done following steps so far.
1. Cram2fastq
2. Split fastq (as they were sequenced on multi lanes)
3. Trim adaptors and move UMIs to headers ( using TRIMMER from AGENT NGS tools - we used these library kits).
4. Alignment and sort (of each splited BAMs)
5. Merge Bams of each sample.
Now, I want to mark duplicates in my Bams, for which I was looking for tools. So far, I could find few tools, what I have tried and learnt:
UmiAwareMarkDuplicatesWithMateCigar (Picard) - Initially, I missed splitting my fastq, so it has only one RG in header and reads were not tagged with RG, it was working (took ~10-11 hrs). But now, when I corrected this splitting step, it does not work (I submit the command, it runs for days, but no success - neither gives me errors, nor results, just running and does not stops).
fgbio - For this, it gives me error of not having MQ tags, even after SetMateInformation (my Bams has MQ tags after using this). (attaching ss)
umitools dedup - taking ages to complete as I have WGS data. So, still not sure about its results, what and when will I have.
CREAK - by AGENT NGS tools. Gives lack of memory error everytime even if using for subset of my reads, running on cluster.
Are there any other tools which can work with UMIs to mark duplicates only (not generate consensus)? Or if anybody has used these tools and had success, can give me some suggestions or comments if I am doing something wrong?
Thanks,
Lipika
UMI-tools is taking a long time as it does error correction of the UMI sequences by default, which the other methods don't do (by default). You can disable this by choosing one of the other UMI resolution methods using the
--method
switch.--method=unique
is the quickest and most naive resolution method - just assumes each read with a unique mapping coordinate and umi combination is unique - doesn't attempt to do anything to correct for errors, but is very fast.Also
dedup
removes duplicates, it doesn't mark them.umi-tools group
does something similar to marking (it assigns each read to a group of reads that share the same mapping coordinates and implied UMI). However, it only annotates this group on read 1.If I used --method=unique, would it be different from using MarkDuplicates by Picard using BARCODE_TAG for UMIs?
Yeah, I expect it would be more or less the same.
Why are you splitting by lanes? If you get the same read with the same UMI in two lanes, you don't want to keep them with, do you?
I want to use BQSR later, which is read group aware, thats why I am splitting them.