I am working on a project that consists of finding associations between SNPs and certain phenotypes using data sets from the dbGaP database. I have found some interesting data sets, I downloaded them from dbGaP, decrypted, and extracted them.
This resulted in some folders with idat format files, gtc.txt files, and some phenotype files in xml format.
I would like to use this data as input for analyzing it in R with packages like SNPassoc, snpMatrix, or GenABEL.
The problem is that it seems that the supported input format of these R packages is a tab delimited table in plain text format, which consist of the sample ID, phenotype data, SNP content, etc. This format is very different from the idat, gtx.txt and xml formats that I found in the dbGaP data.
Is there an R package or any script/program that can take all the dbGaP data (idat, gtx.txt, and phenotype info in xml) and generate summary tables like that one required by the R packages?
Here are some examples of the files found in the dbGaP data that I have extracted:
gtc.txt:
SNP Name GC Score Allele1 - Top Allele2 - Top
Allele1 - AB Allele2 - AB X Y Raw X Raw Y 200003 0.9226053 A A A A 0.934740661177471 0.0394069163635861 7614 1009 200006 0.80280876 G G B B 0.03840060068975691 1.5842950219375036 788
19290 200047 0.7352572 A A A A
0.42971193949905434 0.03922872128858323 3636 953 200050 0.789192 G G B B 0.020351741593668694 1.0929231320570174 545 9315 200052 0.9563731 T T B B 0.01696443095800867 0.9911898858364148 945
12561
phenotype xml:
?xml-stylesheet type="text/xsl" href="varreports_v3.xsl"?>data_table name="MEC_XXXXXX_Subject" dataset_id="XXXXXX" study_name="A Multiethnic GWAS of XXXXXX" study_id="phs000306.v4" participant_set="1" date_created="04/10/2014"><variable id="XXXXXXX.v2.p1" var_name="SUBJID" calculated_type="string" reported_type="integer"><description>XXXXX ID</description><total><subject_profile><sex><male>9454</male><female>13</female></sex></subject_profile><stats><stat n="9482" nulls="0"/></stats></total></variable><variable id="XXXXX.v2.p1.c1" var_name="SUBJID" calculated_type="string" reported_type="integer"><description>XXXX ID</description><total><subject_profile><sex><male>2467</male></sex></subject_profile><stats>
idat is a binary format and can't be read as plain text.
Sorry, but I have a question.
To access dbGaP database do I need special account?
Thank you so much!
You probably should request an account to access all the content of dbGaP database, because some datasets are not open to the public. In my case I had to request an account because I needed to have access these closed datasets.