I have some RNAseq data from miRNAs that I have processed with Bowtie2 (aligning to miRBase). Now, when doing the deduplication with umi_tools dedup I find that some of the files take a lot of time+RAM to finish (some files take around 3-4 minutes and 4-5GB of RAM and some others take more than 2 hours and more than 100GB of RAM). The bam files before the deduplication are very similar in size and the bam files after the deduplication are also very similar in size.
Do you know which could be the reason for this? Thank you very much in advance, Lluc
Here I have a log of a sample that took more than 2 hours.
**# assigned_tag : None
# cell_tag : None
# cell_tag_delim : None
# cell_tag_split : -
# chimeric_pairs : use
# chrom : None
# compresslevel : 6
# detection_method : None
# gene_tag : None
# gene_transcript_map : None
# get_umi_method : read_id
# ignore_umi : False
# in_sam : False
# log2stderr : False
# loglevel : 1
# mapping_quality : 0
# method : directional
# no_sort_output : False
# out_sam : False
# output_unmapped : False
# paired : False
# per_cell : False
# per_contig : False
# per_gene : False
# random_seed : None
# read_length : False
# short_help : None
# skip_regex : ^(__|Unassigned)
# soft_clip_threshold : 4
# spliced : False
# stats : False
# stderr : <_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>
# stdin : <_io.TextIOWrapper name='CA015.bam.sorted.bam' mode='r' encoding='UTF-8'>
# stdlog : <_io.TextIOWrapper name='CA015.bam_dedup.log' mode='a' encoding='UTF-8'>
# stdout : <_io.TextIOWrapper name='CA015.bam.dedup.bam' mode='w' encoding='UTF-8'>
# subset : None
# threshold : 1
# timeit_file : None
# timeit_header : None
# timeit_name : all
# tmpdir : None
# umi_sep : _
# umi_tag : RX
# umi_tag_delim : None
# umi_tag_split : None
# unmapped_reads : discard
# unpaired_reads : use
# whole_contig : False
2022-01-26 13:58:28,207 INFO command: dedup --stdin=CA015.bam.sorted.bam --log=CA015.bam_dedup.log --stdout=CA015.bam.dedup.bam
2022-01-26 13:59:01,275 INFO Written out 100000 reads
2022-01-26 13:59:50,099 INFO Written out 200000 reads
2022-01-26 14:00:17,556 INFO Written out 300000 reads
2022-01-26 14:00:24,747 INFO Parsed 1000000 input reads
2022-01-26 15:43:09,464 INFO Written out 400000 reads
2022-01-26 15:43:09,478 INFO Written out 500000 reads
2022-01-26 15:43:34,766 INFO Written out 600000 reads
2022-01-26 15:44:08,201 INFO Written out 700000 reads
2022-01-26 15:45:06,353 INFO Written out 800000 reads
2022-01-26 15:47:24,894 INFO Written out 900000 reads
2022-01-26 15:47:31,984 INFO Parsed 2000000 input reads
2022-01-26 15:47:34,439 INFO Written out 1000000 reads
2022-01-26 15:48:22,124 INFO Written out 1100000 reads
2022-01-26 15:49:38,812 INFO Written out 1200000 reads
2022-01-26 15:56:26,068 INFO Written out 1300000 reads
2022-01-26 15:56:28,755 INFO Parsed 3000000 input reads
2022-01-26 16:03:26,343 INFO Written out 1400000 reads
2022-01-26 16:18:47,601 INFO Written out 1500000 reads
2022-01-26 16:18:47,605 INFO Written out 1600000 reads
2022-01-26 16:19:53,921 INFO Written out 1700000 reads
2022-01-26 16:21:05,581 INFO Written out 1800000 reads
2022-01-26 16:22:14,632 INFO Written out 1900000 reads
2022-01-26 16:22:15,241 INFO Parsed 4000000 input reads
2022-01-26 16:22:28,080 INFO Reads: Input Reads: 4005923
2022-01-26 16:22:28,080 INFO Number of reads out: 1940951
2022-01-26 16:22:28,080 INFO Total number of positions deduplicated: 1352
2022-01-26 16:22:28,080 INFO Mean number of unique UMIs per position: 1836.91
2022-01-26 16:22:28,080 INFO Max. number of unique UMIs per position: 479850
# job finished in 8639 seconds at Wed Jan 26 16:22:28 2022 -- 8575.97 62.63 0.00 0.00 -- 36bc9d11-7b8f-4d0e-a7bb-dd6495d8027f
I am having the same trouble with 6 (out of 76) of my samples. While all the others ran fine, these 6 samples do not get done even after 16 hours. I also tried changing to --method=unique
The output of the 6 samples that didnt work look like this
Please don't post screenshots of text content. They are difficult to see. Please copy and paste relevant parts of log output and format it as
code
using the101010
button when in edit window.You sample that worked finished after reading 4.3 million reads. The one that didn't finish was still reading reads in after 42 million reads. Its not really surprising its taking longer.