I received genome-wide association (GWAS) data from a colleague who's supposedly done all the imputation and quality control according to the consortium's standards. Genotyping was Illumina 660, imputed to HapMap (3.2 million SNPs total).
The data came to me as a matrix of 11,000 samples (rows) and 3.2 million SNPs (columns). There's a header row for each SNP, and genotypes are coded as the number of minor alleles (or allele dosage for imputed SNPs).
Here's a few rows and columns to show you what it looks like:
rs1793851 rs9929479 rs929483
2.0 0 1
1.6 0 1
2.0 NA 0
2.0 0 1
1.6 0 0
2.0 1 NA
1.0 0 0
1.9 0 2
I've always used PLINK for GWAS data management, QC, and analysis because of its efficient data handling capabilities for GWAS data. However, this kind of data can't be imported directly into PLINK or converted into a pedigree format file. (PLINK does handle imputed data, and so does SNPTEST, but both of these require genotype probabilities and I only have the expected allele dosage).
I did write some R code to read in the data in chunks and run some simple summary and association statistics, but this is clunky and suboptimal for many reasons:
- The dataset first has to be split up (I used a perl wrapper around UNIX/cut to do this). After splitting the dataset into several hundred files with all my samples and a subset of SNPs, computing sample-level measures (sample call rate, relatedness, ethnic outliers) is going to be a real coding nightmare.
- Subsetting analyses is going to be difficult (not as easy as PLINK's --exclude, --include, --keep, --remove, --cluster, etc).
- PLINK integrates SNP annotation info (in the map file) to your results. Joining QC and analysis results to genomic position, minor allele, etc, will require lots of SQL joins.
Ideally I don't want to rewrite software for GWAS data management, QC, and analysis. I've considered (1) analyzing only genotyped SNPS, or (2) rounding the allele dosage to the nearest integer so I can use PLINK, but both of these methods discard useful data.
Does anyone have any suggestions on how I should start to QC and analyze this data without re-inventing the wheel or rewriting PLINK? Any other software suggestions that could take this kind of data? Keep in mind, my dataset is nearly 100GB.
Thanks in advance.
Have you tried Beagle and PRESTO? http://faculty.washington.edu/browning/beagle/beagle.html
Looked through all the beagle utilities and didn't see much that would help.
Hi Stephen, Have you successfully manage these kind of data? I got a similar matrix (markers by sample genotype). Is there a way to convert it to plink format? Thanks
Just a hint Shirley. This thread is 5.7 years old so I don't know if you'll get an answer from Stephen.
Can you post a sample file? If it has individuals as rows and markers as columns then you can directly read it in plink. Use below options according to your input file Just in case you have markers as rows and individuals as columns, you'll have to transpose
and then use "output" file with these options
Hope that helps!
Hello everyone
I have received the imputed data (from IMPUTE) divided into ch 1 to 22 (.bid, .bim, and .fam) files.
In the next step, I would like to perform the post imputation QC and association analysis.
I have experience working with a single file system earlier, but this is the first time where I have data in ch1~ ch22 file system.
Please let me know if any protocol/tool I can follow to perform the post imputation QC and association analysis.
Thanks in advance M