Question

PacBio sequences, proovread correction with Illumina HiSeq reads

0

Entering edit mode

8.9 years ago

jaimejr18 • 0

Hi

I want to correct PacBio sequences with Illumina Hiseq data using proovread tool, I can use 2x60 core machine with 3TB of Memory or another one with 16x24 with 512Gb.

Developers advise to chunk the smrt cell data into small files and use them to correct the PacBio data. I did and have 475 files with 50Mb each. I used proovread with one of the files and 50x coverage and took 12 hours in the large memory machine. But to correct all the files I have to run one by one and it's going to take very long time.

Does anyone use this tool? could I run more than one PacBio file in each run to improve the total time?

Thanks

proovread hybrid-correction PacBio-correction • 3.7k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 8.9 years ago by jaimejr18 • 0

1

Entering edit mode

I don't know about the tool you're using, but if you have those resources available, why not do the samples in parallel? What was the resource usage like when you tested it on one sample?

ADD REPLY • link 8.9 years ago by andrew.j.skelton73 6.6k

0

Entering edit mode

Not sure, is why I ask if someone have used this tool, because I'm new in this kind of tools and I'm not sure how to run it.

I have restricted access to the machines, time and core restrictions, if I use 60 cores I can only use it for 12 hours and I can't run another job until this one finish, so 12hours x 475= 237 days...

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.9 years ago by jaimejr18 • 0

0

Entering edit mode

Crossposted to SeqAnswers http://seqanswers.com/forums/showthread.php?t=65379

ADD REPLY • link 8.9 years ago by User 59 13k

0

Entering edit mode

yes it the same, I did it, should I retry one of them?

ADD REPLY • link 8.9 years ago by jaimejr18 • 0

Ram · Answer 1 · 2016-01-09

Hi Jaime,

I had a look at the log you sent me (parsed according to grep command from https://github.com/BioInf-Wuerzburg/proovread#log-and-statistics)

[Mon Nov 30 17:38:30 2015] Running mode: sr
[Mon Nov 30 17:56:08 2015] Running task bwa-sr-1
[Wed Dec  2 17:20:54 2015] Masked : 81.1%
[Wed Dec  2 17:20:54 2015] Running task bwa-sr-2
[Fri Dec  4 00:23:09 2015] Masked : 89.1%
[Fri Dec  4 00:23:09 2015] Running task bwa-sr-3
[Sat Dec  5 12:03:45 2015] Masked : 91.2%
[Sat Dec  5 12:03:46 2015] Running task bwa-sr-finish
[Sun Dec  6 05:54:53 2015] Masked : 89.2%

Proovread ran 3 correction iterations (, which successively improve reads quality, with high quality corrected parts being "Masked") followed by the finish correction, which is mostly for polishing.

Your stats look pretty good. Getting up to 81.1% in the first iteration is great. It means that you get more than 81% percent of your data corrected right away. The second iteration gets you to 89.1, The third only to 91.2%. The default cutoff for proovread to stop iterating and start the polish step is either 92% masked or less than 3% gained compared to the previous iteration. The 92% are quite ambitious, in particular for large genome projects. In your case, the 3rd iteration also does not really get you anything but still takes 12h to run.

Therefore, my suggestion for your setup would be:

Don't aim for 92%, but rather something like 85%. This will save a lot of time (only two iterations) without loosing a noteworthy portion of your data. You can set this via a custom config. Put the following line in a file (my-proovread.cfg)

'mask-shortcut-frac' => 0.85

and call

proovread -c /path/to/my-proovread.cfg -l .. -s ..

If you compare your results to runs with lower illumina coverage (30x or 40x), you should aim for the same thing - >85% after the second iteration. If you can get that from 30x or 40x, then using lower coverages would increase runtime even further, if not, stick with 50x.

As for chunk size, I would try to optimize for maximum possible for you queue with respect to runtime limit and memory. Your queue has 24 core nodes with sufficient memory and a 144h per job limit. Given runtime of 24h for your test runs, you should be able to increase chunk size at least by a factor of 6 (144h/24h). On top, larger chunks will run a bit faster anyway and you will save time with the lowered mask-shortcut-frac cutoff. So my guess is that you should be able to run at least 500MB chunks on your normal_q (or also entire SMRT cells..)

Let me know, how it goes.

Ram · Answer 2 · 2016-02-08

0

Entering edit mode

8.8 years ago

jaimejr18 • 0

Hi Thomas,

Sorry for the delay, finally I correct all the sequences, here I print some info about, following your instructions I changed the config.cfg file to 0.85 (files 1 to 10), but still some of them went to bsa-sr step 4th even 5th consuming lot of time. Files 1 and 4 were tested to different coverage and 50x were the best in time and final corrected %. To save some time I modify config.cfg to 0.8 or less 5% gained to stop on files 11 to 40. I think those parameters are good to correct such amount of data, but I hope to hear from you soon, maybe with some improvements because those are only 1/4 of the total reads. The process takes a total of 27-30 days. Thanks!

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by jaimejr18 • 0

0

Entering edit mode

Yeah, I think reducing to 80%/5%-new masked makes sense for your data. I don't really have any further suggestions regarding improvements. Did you run the jobs on your "normal_q"? With 30 nodes, 3 days per chunk and 120 remaining chunk, it should take 10 days to complete correction, unless your queue is jammed...

ADD REPLY • link 8.8 years ago by thackl ★ 3.0k

0

Entering edit mode

yes, finally I used the "normal_q" with 24 threads per job, because I can run 6 jobs at the same time with enough wall time to complete. At the "largemem_q" I didn't allow to have wall time enough to finish the job. The total of chunks was 48 / 6 jobs per time =8 x 3days=24 days in total approximately, some of them fail.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by jaimejr18 • 0