How to simulate phenotype from real genetic data for GWAS purpose?
1
0
Entering edit mode
4.6 years ago
b.ambrozio ▴ 30

I'm trying to simulate binary phenotypes from the 1000 Genome Phase 3 datasets using gcta64 --simu-cc, but no success. Everything seems to be going well, but in the end I get:

Error: can not open the file [] to read.
An error occurs, please check the options or data

And the log shows:

Accepted options:

Here's the commands I'm using:

# Convert the VCF to plink format:
$ ./plink2 --vcf ../../ALL.phase3.biallelic-only.vcf.gz.10kSNPs.vcf.gz --make-bed --out ALL.phase3.biallelic-only.vcf.gz.10kSNPs

# Try to simulate the phenotype:
$ ./gcta64 --bfile ALL.phase3.biallelic-only.vcf.gz.10kSNPs --simu-cc 500 500 --simu-hsq 0.5 --simu-k 0.1 --simu-rep 3 --out ALL.phase3.biallelic-only.vcf.gz.10kSNPs

Here's the whole steps with the outputs:

$ ls
gcta64  plink2

$ ./plink2 --vcf ../../ALL.phase3.biallelic-only.vcf.gz.10kSNPs.vcf.gz --make-bed --out ALL.phase3.biallelic-only.vcf.gz.10kSNPs
PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.phase3.biallelic-only.vcf.gz.10kSNPs.log.
Options in effect:
  --make-bed
  --out ALL.phase3.biallelic-only.vcf.gz.10kSNPs
  --vcf ../../ALL.phase3.biallelic-only.vcf.gz.10kSNPs.vcf.gz

Start time: Sun Mar 22 11:56:23 2020
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 220000 variants scanned.
--vcf: ALL.phase3.biallelic-only.vcf.gz.10kSNPs-temporary.pgen +
ALL.phase3.biallelic-only.vcf.gz.10kSNPs-temporary.pvar +
ALL.phase3.biallelic-only.vcf.gz.10kSNPs-temporary.psam written.
2504 samples (0 females, 0 males, 2504 ambiguous; 2504 founders) loaded from
ALL.phase3.biallelic-only.vcf.gz.10kSNPs-temporary.psam.
220000 variants loaded from
ALL.phase3.biallelic-only.vcf.gz.10kSNPs-temporary.pvar.
Note: No phenotype data present.
Writing ALL.phase3.biallelic-only.vcf.gz.10kSNPs.fam ... done.
Writing ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bim ... done.
Writing ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bed ... done.
End time: Sun Mar 22 11:56:28 2020

$ ls
ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bed    ALL.phase3.biallelic-only.vcf.gz.10kSNPs.fam    gcta64
ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bim    ALL.phase3.biallelic-only.vcf.gz.10kSNPs.log    plink2

$ ./gcta64 --bfile ALL.phase3.biallelic-only.vcf.gz.10kSNPs --simu-cc 500 500 --simu-hsq 0.5 --simu-k 0.1 --simu-rep 3 --out ALL.phase3.biallelic-only.vcf.gz.10kSNPs
*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version 1.93.0 beta Mac
* (C) 2010-2019, The University of Queensland
* Please report bugs to Jian Yang <jian.yang@uq.edu.au>
*******************************************************************
Analysis started at 11:59:16 GMT on Sun Mar 22 2020.
Hostname: Brunos-MBP

Accepted options:
--bfile ALL.phase3.biallelic-only.vcf.gz.10kSNPs
--simu-cc 500 500
--simu-hsq 0.5
--simu-k 0.1
--simu-rep 3
--out ALL.phase3.biallelic-only.vcf.gz.10kSNPs


Reading PLINK FAM file from [ALL.phase3.biallelic-only.vcf.gz.10kSNPs.fam].
2504 individuals to be included from [ALL.phase3.biallelic-only.vcf.gz.10kSNPs.fam].
Reading PLINK BIM file from [ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bim].
220000 SNPs to be included from [ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bim].
Warning: Duplicated SNP ID "rs145607083" has been changed to "rs145607083_5264"
.Warning: Duplicated SNP ID "rs145607083" has been changed to "rs145607083_5265"
.Warning: Duplicated SNP ID "rs71955229" has been changed to "rs71955229_27061"
.Warning: Duplicated SNP ID "rs71955229" has been changed to "rs71955229_27062"
.Warning: Duplicated SNP ID "rs71589472" has been changed to "rs71589472_42505"
.Warning: Duplicated SNP ID "rs563156514" has been changed to "rs563156514_49111"
.Warning: Duplicated SNP ID "rs563156514" has been changed to "rs563156514_49112"
.Warning: Duplicated SNP ID "rs539504239" has been changed to "rs539504239_79196"
.Warning: Duplicated SNP ID "rs35739849" has been changed to "rs35739849_105514"
.Warning: Duplicated SNP ID "rs148795567" has been changed to "rs148795567_123134"
.Warning: Duplicated SNP ID "rs143101359" has been changed to "rs143101359_201815"
.Reading PLINK BED file from [ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bed] in SNP-major format ...
Genotype data for 2504 individuals and 220000 SNPs to be included from [ALL.phase3.biallelic-only.vcf.gz.10kSNPs.bed].
Simulation parameters:
Number of simulation replicate(s) = 3 (Default = 1)
Heritability of liability = 0.5 (Default = 0.1)
Disease prevalence = 0.1 (Default = 0.1)
Number of cases = 500
Number of controls = 500

Error: can not open the file [] to read.
An error occurs, please check the options or data

$ head ALL.phase3.biallelic-only.vcf.gz.10kSNPs.log 
*******************************************************************
* Genome-wide Complex Trait Analysis (GCTA)
* version 1.93.0 beta Mac
* (C) 2010-2019, The University of Queensland
* Please report bugs to Jian Yang <jian.yang@uq.edu.au>
*******************************************************************
Analysis started at 11:59:16 GMT on Sun Mar 22 2020.
Hostname: Brunos-MBP

Accepted options:

I'm opened to use other tools if you have any recommendation. Not sure, but looks like Plink doesn't do that (you can simulate phenotype as long as you also simulate the genetic data as well...)

gcta plink • 1.7k views
ADD COMMENT
3
Entering edit mode
4.6 years ago
jian.yang.qt ▴ 30

You need to specify the causal variants using the --simu-causal-loci option.

ADD COMMENT

Login before adding your answer.

Traffic: 2107 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6