Entering edit mode
4.5 years ago
md.solayman
•
0
Hi, after getting sam file from Bowtie2, I convert it to sorted.bam and then index it according to UMI-tools documention. Seems all are fine, but, when I run umi_tools dedup with the following command, nothing is written in the output file. I had removed the '--output-stats' command to make it faster.
command: umi_tools dedup -I example.bam -S deduplicated.bam
However, before running my final dataset, I had checked with a sample dataset, it that case it was working. Do you have any idea, please? Thanks.
Have you saved the log output anywhere? Either with
-L logfile.log
or by redirection? You need to know ifdedup
finished successfully. The log should end with aJob finished
message.Hi, I haven't save logfile. However, you are right, my job was killed due to providing low resource (ncpselect=1:ncpus=2:mem=250gb,walltime=50:00:00). Now I have provided 160 hours, it still seems resource may not be sufficient though I could add more walltime. My umi size is 18nt and the total input reads is 103861189 (single-end). I have access to HPC. What resource should I select, any idea, please?
Thank you very much for your reply.
So the 16S is only about 1.5kb long, and you've sequenced with over 100M reads. If coverage is even, that means you will have 74,000 reads on every position. This is going to mean either that either you have a very large number of UMIs, or a very high duplication rate. It might be import to know - if you have 74,000 different UMIs at every position, that its probably hopeless.
We've only benchmarked as far as 16,000 UMIs at one position, but in this case it takes around 30 minutes to compute. You have 1,350 positions!
Its unlikely that you do have 74,000 UMIs at every position however. If you look in the output of extract it should list every UMI identified. You can get a rough estimate of the size of the problem by dividing the number of UMIs by the length of the RNA you are sequencing. Our benchmarks suggest:
I'm not sure what to suggest if you are in a regieme where this sort of analysis isn't really feasible. You could divide up your BAM file by position (say 200bp blocks) and run each on a seperate process.
One possibility, that would seem to be a shame since you have a good UMI length, would be to disable error correction on the deduplication, with
--method=unique
. You will find this much faster, but the results won't be as good.Alternatively, Daniel Liu's UMICollapse (https://github.com/Daniel-Liu-c0deb0t/UMICollapse) is much faster than umi_tools. It is less flexible, but it should work for your particular analysis.
Thank you very much. The UMICollapse worked for me. I am really grateful to you.
The level of resource usage is more dependent on the level of duplication that it is on the total number of reads. Memory usage in particular is dependent on the maximum number of UMIs at any one position. What sort of data is this? Is it scRNA-seq? Amplicon sequencing? Exon-sequencing?
My sample was E. coli ribosome. I had used Illumina Hiseq platform for sequencing. Thanks.
Upsss! Again, my job has been killed after running 107 hours, though my walltime was 160 hours. In error file I found the following line: 5457450.pbsserver.SC: line 15: 12808 Killed