Question

DIAMOND alignment processing time

0

Entering edit mode

3.2 years ago

Dee • 0

Good day,

I have short reads from Illumina (paired-end, 150x2). I concatenated the two files (2.3GB each) (since the merging results were not good) before aligning with DIAMOND using blastx on HPC with 88 threads and 270GB RAM. I also added -b 8 -c 1 on the command. It's been running for almost 24 hours now. I'm wondering how long does it take to finish the alignment process? I'm thinking maybe my files are too big? What could be an alternative workaround?

Thank you!

DIAMOND HPC • 2.8k views

ADD COMMENT • link updated 2.7 years ago by valentina ▴ 60 • written 3.2 years ago by Dee • 0

0

Entering edit mode

Depends on the database you are aligning to. It sounds like you may have tens of millions or reads, if not hundreds of millions. Now, let's say that you can align 100 reads per second, which is debatable if your database is large. It would still take you at least 100,000+ seconds, which is ~28 hours. If you scale that to the actual number of reads, you should get a pretty good idea how long this would take. It sounds to me that you are looking at several days to a week.

What kind of sequencing have you done? What is the purpose of this alignment? Why don't you assemble the reads first and then align the contigs instead?

ADD REPLY • link 3.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Hello Mensur Dlakic!

I've created a diamond database from NCBI nr. The reads are shotgun metagenomes. I'm planning to use the output file (.daa) as input to MEGAN for microbial community analysis.

How long do you think would it take if I use contigs for alignment?

Thank you!

ADD REPLY • link 3.2 years ago by Dee • 0

0

Entering edit mode

Now that I know you are searching against a non-redundant database, I would adjust my original estimate to something like a month. It should take significantly less for contigs, but probably still a long time. I don't think it is a productive task to search raw reads against the nt database, or in general to use MEGAN for this purpose any more. There are other tools you may want to consider:

After the assembly:

https://github.com/Ecogenomics/GTDBTk

If you still want to continue doing it the way you started, you can try to estimate the time requirement on your own. Pick a reasonably large number of reads, say 0.5-1 million, and run your analysis on it. Then multiply it by the actual number of reads relative to this fraction, and you will get your estimate. I'd be very surprised if you weren't looking at weeks. Also, not sure how (and whether) MEGAN will handle the results from tens (hundreds?) of millions of sequences.

ADD REPLY • link 3.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Hello Mensur, many thanks for your answer, it was very clear. Me too I used diamond against nr and then MEGAN with the same problem. However I choose to use diamon/MEGAN not only to identify taxa but also to do the protein function classification. I wonder, can you suggest some faster alternative method to do the functional analysis? Thanks in advance!

ADD REPLY • link 2.7 years ago by valentina ▴ 60

0

Entering edit mode

Just to be sure: you are searching against the nt database, not nr?

ADD REPLY • link 3.2 years ago by Mensur Dlakic ★ 28k

0

Entering edit mode

Hello Mensur Dlakic!

I've mentioned in my comment that the database is nr. I'll also check and consider the other tools that you've mentioned.

Thank you very much for the information!

ADD REPLY • link 3.2 years ago by Dee • 0

0

Entering edit mode

Don't remember if MEGAN for some reason requires nr, but searching DNA sequences against the protein database may not give you optimal results when it comes to taxonomic classification. Many DNA sequences are clearly different when you compare them as such, but less so when translated into proteins.

ADD REPLY • link 3.2 years ago by Mensur Dlakic ★ 28k