Has Anyone Got Plink Working With Gnu Parallel?
2
5
Entering edit mode
13.1 years ago
jvijai ★ 1.2k

I usually use a script run.sh) to run my PLINK analysis (sort of like a makefile).
Has anyone managed to run PLINK with GNU Parallel? If so, how to use GNU Parallel with a script?
I tried the following but I cannot see 20 cores being run (by top).

>parallel -j 20 --progress  | ./run.sh

Any help appreciated. Thanks

Edits: Yes, I have 40 cores. Normally I would do like this:

Using the file MyCovarfile.raw (structure shown at the bottom) for Analysis of Pheno1 and covars Age, PC1, PC2, PC3.

plink --bfile myfile \  
--pheno MyCovarfile.raw \  
 --pheno-name Pheno1 \  
 --covar MyCovarfile.raw \  
 --covar-name Age-PC3 \  
 --logistic \  
 --adjust \  
 --qq-plot \  
 --out Pheno1_Age_PC3

Now, I can do the same with many different covariate models (and my run.sh is a list of such commands with different combinations of the covariates and phenotypes), but right now, they get excecuted on a single core one after the other serially.

Using the file MyCovarfile.raw (structure shown at the bottom) for Analysis of Pheno2 and covars Age, PC1, PC2, PC3.

plink --bfile myfile \  
 --pheno MyCovarfile.raw \  
 --pheno-name Pheno2 \  
 --covar MyCovarfile.raw \  
 --covar-name Age-PC3 \  
 --logistic \  
 --adjust \  
 --qq-plot \  
 --out Pheno2_Age_PC3

Using the file MyCovarfile.raw (structure shown at the bottom) for Analysis of Pheno2 and covars Age, PC1, PC2, PC3, PC4.

plink --bfile myfile \  
 --pheno MyCovarfile.raw \  
 --pheno-name Pheno1 \  
 --covar MyCovarfile.raw \  
 --covar-name Age-PC4 \  
 --logistic \  
 --adjust \  
 --qq-plot \  
 --out Pheno1_Age_PC4

Using the file MyCovarfile.raw (structure shown at the bottom) for Analysis of Pheno2 and covars Age, PC1, PC2, PC3, PC4.

plink --bfile myfile \  
 --pheno MyCovarfile.raw \  
 --pheno-name Pheno2 \  
 --covar MyCovarfile.raw \  
 --covar-name Age-PC4 \  
 --logistic \  
 --adjust \  
 --qq-plot \  
 --out Pheno2_Age_PC4

Hope this helps. http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml#covar

Structure of MyCovarfile.raw

FID    IID    AFF    Pheno1    Pheno2    Pheno3    Pheno4    Pheno5    Pheno6    Pheno7    Bin    AGE    PC1    PC2    PC3    PC4  
0001    9542    1    1    1    1    1    1    1    1    1    8    -0.0053    -0.0046    0.0036    -0.0052  
0002    9606    1    1    1    1    1    1    1    1    1    3    -0.0052    -0.0045    0.0035    -0.0021  
0003    9702    2    1    1    1    1    1    1    1    1    3    -0.0045    -0.0041    0.0032    0.0016  
0004    9544    2    1    1    1    1    1    1    1    1    5    -0.0037    -0.0028    0.0032    0.0003

where FID, IID and AFF means familyID, Individual-ID and Affection status of the individual.

plink parallel scripting • 7.9k views
ADD COMMENT
3
Entering edit mode

let's start with the obvious - do you have 20 cores in your machine?

ADD REPLY
0
Entering edit mode

I don't know how to normally run PLINK on 2 sets of data. Please show how you would run PLINK on to sets of data in serial. Also please state whether you have watched the intro videos:

ADD REPLY
0
Entering edit mode

I don't know how to normally run PLINK on 2 sets of data. Please show how you would run PLINK on two sets of data in serial (if you cannot do that you most likely cannot use GNU Parallel). Also please state whether you have watched the intro videos:

ADD REPLY
0
Entering edit mode

Hi Chris, Yes, I have 40 cores.

Ole Tange, Yes watched the intro videos and tried reading the manual too.

Normally I would do like this:

plink --bfile myfile  
      --pheno MyCovarfile.raw  
      --pheno-name Pheno1    
      --covar MyCovarfile.raw   
      --covar-name Age-PC3  
      --logistic   
      --adjust    
      --qq-plot    
      --out Pheno1_Age_PC3

Now, I can do the same with many different covariate models, but right now, they get excecuted on a single core one after the other serially.

Hope this helps.

ADD REPLY
0
Entering edit mode

Yes, I have 40 cores, Yes, watched the video intro and read a buit through the manual without really understanding it.

ADD REPLY
0
Entering edit mode

There is a really nice intro to installation of GNU Parallel here.

ADD REPLY
12
Entering edit mode
13.1 years ago

What is the content of your run.sh script?

To use GNU/Parallel effectively, you need to write a script that launches a single job (e.g. a single plink work) and then call it through the parallel syntax.

Example:

$: seq 1 20 | parallel -j 20 --progress run_single_plink_job.sh

# you can use {} to differentiate each call
$: seq 1 20 | parallel -j 20 --progress run_single_plink_job.sh --job_id {}
$: seq 1 22 | parallel -j 20 --progress run_single_plink_job.sh --chromosome {}

seq 1 20 is a common shortcut to tell parallel that run_single_plink_job.sh must be executed twenty times. For each of these calls, the variable {} will take a different value from 1 to 20. Example:

$: seq 1 20 | parallel "echo This is job {}"
This is job 1
This is job 2
This is job 3
This is job 4
...
This is job 20

Alternatively, you can provide arguments to parallel by putting them at the end of the command, separated by a ":::"

$: parallel "echo This is job {}" ::: {1..20}
This is job 1
This is job 2
This is job 3
This is job 4
....
This is job 20

or

$: parallel "echo Hello {}" ::: Marc Caius Julius Caesar
Hello Marc
Hello Caius
Hello Julius
Hello Caesar

If your run.sh script contains a series of commands that can be executed in parallel, each line being a single job, you can just cat it and pipe it to parallel:

$: cat run.sh | parallel -j 20 --progress
$: paste run.sh | parallel -j 20 --progress

Example:

$: cat >parallel_job.sh
sleep 2; echo "job1"
sleep 1; echo "job2"
echo "job3"

$: cat parallel_job.sh | parallel -u -j 3
job3
job2
job1

# alternatively, use paste
$: paste parallel_job.sh | parallel -u -j 3
job3
job2
job1

I suggest you to use htop to monitor the execution of the jobs. Moreover, remember that if you don't use the -u option, you won't see any output untill the job is finished.

Note: be sure that you are using GNU/Parallel and not parallel from moreutils, or the syntax will be different.

ADD COMMENT
3
Entering edit mode

excellent intro on using parallel, thanks!

ADD REPLY
1
Entering edit mode

seq 1 20 simply prints numbers from 1 to 20. It is a common shortcut to tell parallel that he must execute the command 20 times. I have updated the answer.

ADD REPLY
0
Entering edit mode

Thanks Giovanni, I will try this. Can you explain what the "seq 1 20" does in this instance?

ADD REPLY
3
Entering edit mode
13.1 years ago
tange ▴ 190

Let us assume you want to run this:

plink --bfile myfile \
--pheno MyCovarfile.raw \
--pheno-name Pheno1 \
--covar MyCovarfile.raw \
--covar-name Age-PC3 \
--logistic \
--adjust \
--qq-plot \
--out Pheno1_Age_PC3

But instead of Pheno1 you want Pheno1, Pheno2, Pheno3, Pheno4, Pheno5, Pheno6, Pheno7 and instead of PC3 you want PC1, PC2, PC3, PC4. And you want all combinations of those:

parallel plink --bfile myfile \
--pheno MyCovarfile.raw \
--pheno-name {1} \
--covar MyCovarfile.raw \
--covar-name Age-{2} \
--logistic \
--adjust \
--qq-plot \
--out {1}_Age_{2} \
::: Pheno1 Pheno2 Pheno3 Pheno4 Pheno5 Pheno6 Pheno7 \
::: PC1 PC2 PC3 PC4

Try prepeding plink with echo to see if this will execute what you want.

ADD COMMENT
0
Entering edit mode

Hi Tange, for most part that is correct, however as shown in the link to the PLINK site, the pheno-name and covar-name are columns within the MyCovarfile.raw. The way the analysis works is it picks a column specified as the pheno values for an analysis, and so you can have columns such as Pheno1, Pheno2, Pheno3 and so on in the same file (not separate file). Yes, I think I can split those columns to individual files and run the analysis you suggested. But the general feeling I get is that this is not a parallelizable job in the way PLINK treats it.. .

ADD REPLY
0
Entering edit mode

Please give us 3 examples instead of describing those. You are making it much harder to give you a useful answer.

ADD REPLY
0
Entering edit mode

Added more examples. The Covarfile structure is important to understand how PLINK calls the column for analysis. If GNU parallel can help with that part, it would be quite helpful. If this works.. I have another question with IMPUTE2 that would be a help to many many researchers I know..!

ADD REPLY
0
Entering edit mode

I successfully used gtools to create gens file from ped file using GNU Parallel.. ! :)

ADD REPLY
0
Entering edit mode

That worked perfectly! Thank you. Nice to know how parallel can take columns in a file as parameters. I had a related question. If your arguments such as Pheno1 Pheno2... Pheno7 are new lines in a file, how would the parallel command look like?

ADD REPLY

Login before adding your answer.

Traffic: 1756 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6