parallel for 10000 whole exome data
1
0
Entering edit mode
23 months ago
alwayshope ▴ 40

Dear all,

May I know the recommend solutions for the massive big sample parallel process? Open MPI... ?

Thanks a million!

parallel exome • 1.5k views
ADD COMMENT
1
Entering edit mode

What are you analyzing? What is your hardware? What is your pipeline? There are hundreds of possible answers to this.

ADD REPLY
0
Entering edit mode

Thanks a lot! Trying to analyze large WES data(10000 or more), using the super computer, very standard WES pipeline. Thank you very much!

ADD REPLY
1
Entering edit mode

for the massive big sample parallel process

what does it mean ?

ADD REPLY
0
Entering edit mode

Thanks! Trying to analyze large WES data(10000 or more).

ADD REPLY
0
Entering edit mode

I'm sorry but "analyze" doesn't mean much more than "process".

ADD REPLY
0
Entering edit mode

Sure, thanks a lot!

ADD REPLY
2
Entering edit mode
23 months ago
ATpoint 85k

Use a dedicated workflow manager, such as Nextflow or Snakemake. They offer parallelization over jobs along the way plus integrate well with containerization solutions such as Docker and Singularity.There are existing workflows for WES data, such as nf-core sarek. May I ask whether you produced these samples, or whether you plan to download and reanalyze? 10k is going to require extensive storage and computing resources.

ADD COMMENT
0
Entering edit mode

Thanks a lot for your guidance!

Trying to get the data from UK biobank (eg. https://biobank.ndph.ox.ac.uk/ukb/field.cgi?id=23153) and do some ML model training before using the inhouse data(not yet generated). Yeah, the CARM file alone for 10000 WES would cost near 20T storage and huge computing power. And also computing efficiency (cores usage strategy) is quite important.

ADD REPLY
2
Entering edit mode

I wonder whether obtaining VCF files might not do the trick a million times more efficiently (given they provide it, I guess they do?)? In any case, mentioned workflow managers will offer you full flexibility. You can define processes (like alignment, filtering, whatever) and give these resources as desired and optimal, and the workflow manager will then take care of the parallelization across jobs (along the DAG) maxing out the infrastructure resources you give it. The WF managers have caching options to resume the pipeline in case of failures along the way, all these are critical features for very large jobs.

ADD REPLY
2
Entering edit mode

I'm almost 100% certain that preprocced results in some form (like a VCF) already exist for the UKBB.

Unless your project is specifically about improving the processing of raw sequence data to variant calls, I would seriously consider using these. You will save yourself months of HPC time.

I'd guess each sample would take if the order of multiple CPU days, so you are probably looking at tens of CPU years in total. Even with 500 cores, working at 100% efficiency, that's this a good part of a month to produce something that already exists.

ADD REPLY
2
Entering edit mode

VCF for the ~500k UKBB participants is available : https://biobank.ndph.ox.ac.uk/showcase/field.cgi?id=23141 ;)

ADD REPLY

Login before adding your answer.

Traffic: 1653 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6