I'm 100% new to Bioinformatics and terrible with computers (medical doctor). 'Currently working on a GWAS with the data from the Human Connectome Project (HCP). Running into some issues, please bear with me if my description of the issue isn*t optimal
Using PLINK
Already have the Phenotypes from the HCP webpage in .csv format
This is how my .fam file looks like
52259_82122 100004 52259 82122 1 -9
56037_85858 100206 56037 85858 1 -9
51488_81352 100307 51488 81352 2 -9
51730_81594 100408 51730 81594 1 -9
52813_82634 100610 52813 82634 1 -9
51283_52850_81149 101006 51283 81149 2 -9
51969_81833 101107 51969 81833 1 -9
51330_81195 101208 51330 81195 2 -9
52385_82248 101309 52385 82248 1 -9
52198_82061 101410 52198 82061 1 -9
This is how my phenotype .cvs file looks like
Subject,Age_in_Yrs,HasGT,ZygositySR,ZygosityGT,Family_ID,Mother_ID,Father_ID,TestRetestInterval,Race,Ethnicity,Handedness,SSAGA_Employ,SSAGA_Income,SSAGA_Educ,SSAGA_InSchool,SSAGA_Rlshp,SSAGA_MOBorn,Height,Weight,BMI,SSAGA_BMICat,SSAGA_BMICatHeaviest,Blood_Drawn,Hematocrit_1,Hematocrit_2,BPSystolic,BPDiastolic,ThyroidHormone,HbA1C,Hypothyroidism,Hypothyroidism_Onset,Hyperthyroidism,Hyperthyroidism_Onset,OtherEndocrn_Prob,OtherEndocrine_ProbOnset,Menstrual_RegCycles,Menstrual_Explain,Menstrual_AgeBegan,Menstrual_CycleLength,Menstrual_DaysSinceLast,Menstrual_AgeIrreg,Menstrual_AgeStop,Menstrual_MonthsSinceStop,Menstrual_UsingBirthControl,Menstrual_BirthControlCode,FamHist_Moth_Scz,FamHist_Fath_Scz,FamHist_Moth_Dep,FamHist_Fath_Dep,FamHist_Moth_BP,FamHist_Fath_BP,FamHist_Moth_Anx,FamHist_Fath_Anx,FamHist_Moth_DrgAlc,FamHist_Fath_DrgAlc,FamHist_Moth_Alz,FamHist_Fath_Alz,FamHist_Moth_PD,FamHist_Fath_PD,FamHist_Moth_TS,FamHist_Fath_TS,FamHist_Moth_None,FamHist_Fath_None,ASR_Anxd_Raw,ASR_Anxd_Pct,ASR_Witd_Raw,ASR_Witd_T,ASR_Soma_Raw,ASR_Soma_T,ASR_Thot_Raw,ASR_Thot_T,ASR_Attn_Raw,ASR_Attn_T,ASR_Aggr_Raw,ASR_Aggr_T,ASR_Rule_Raw,ASR_Rule_T,ASR_Intr_Raw,ASR_Intr_T,ASR_Oth_Raw,ASR_Crit_Raw,ASR_Intn_Raw,ASR_Intn_T,ASR_Extn_Raw,ASR_Extn_T,ASR_TAO_Sum,ASR_Totp_Raw,ASR_Totp_T,DSM_Depr_Raw,DSM_Depr_T,DSM_Anxi_Raw,DSM_Anxi_T,DSM_Somp_Raw,DSM_Somp_T,DSM_Avoid_Raw,DSM_Avoid_T,DSM_Adh_Raw,DSM_Adh_T,DSM_Inat_Raw,DSM_Hype_Raw,DSM_Antis_Raw,DSM_Antis_T,SSAGA_ChildhoodConduct,SSAGA_PanicDisorder,SSAGA_Agoraphobia,SSAGA_Depressive_Ep,SSAGA_Depressive_Sx,Color_Vision,Eye,EVA_Num,EVA_Denom,Correction,Breathalyzer_Over_05,Breathalyzer_Over_08,Cocaine,THC,Opiates,Amphetamines,MethAmphetamine,Oxycontin,Total_Drinks_7days,Num_Days_Drank_7days,Avg_Weekday_Drinks_7days,Avg_Weekend_Drinks_7days,Total_Beer_Wine_Cooler_7days,Avg_Weekday_Beer_Wine_Cooler_7days,Avg_Weekend_Beer_Wine_Cooler_7days,Total_Malt_Liquor_7days,Avg_Weekday_Malt_Liquor_7days,Avg_Weekend_Malt_Liquor_7days,Total_Wine_7days,Avg_Weekday_Wine_7days,Avg_Weekend_Wine_7days,Total_Hard_Liquor_7days,Avg_Weekday_Hard_Liquor_7days,Avg_Weekend_Hard_Liquor_7days,Total_Other_Alc_7days,Avg_Weekday_Other_Alc_7days,Avg_Weekend_Other_Alc_7days,SSAGA_Alc_D4_Dp_Sx,SSAGA_Alc_D4_Ab_Dx,SSAGA_Alc_D4_Ab_Sx,SSAGA_Alc_D4_Dp_Dx,SSAGA_Alc_12_Drinks_Per_Day,SSAGA_Alc_12_Frq,SSAGA_Alc_12_Frq_5plus,SSAGA_Alc_12_Frq_Drk,SSAGA_Alc_12_Max_Drinks,SSAGA_Alc_Age_1st_Use,SSAGA_Alc_Hvy_Drinks_Per_Day,SSAGA_Alc_Hvy_Frq,SSAGA_Alc_Hvy_Frq_5plus,SSAGA_Alc_Hvy_Frq_Drk,SSAGA_Alc_Hvy_Max_Drinks,Total_Any_Tobacco_7days,Times_Used_Any_Tobacco_Today,Num_Days_Used_Any_Tobacco_7days,Avg_Weekday_Any_Tobacco_7days,Avg_Weekend_Any_Tobacco_7days,Total_Cigarettes_7days,Avg_Weekday_Cigarettes_7days,Avg_Weekend_Cigarettes_7days,Total_Cigars_7days,Avg_Weekday_Cigars_7days,Avg_Weekend_Cigars_7days,Total_Pipes_7days,Avg_Weekday_Pipes_7days,Avg_Weekend_Pipes_7days,Total_Chew_7days,Avg_Weekday_Chew_7days,Avg_Weekend_Chew_7days,Total_Snuff_7days,Avg_Weekday_Snuff_7days,Avg_Weekend_Snuff_7days,Total_Other_Tobacco_7days,Avg_Weekday_Other_Tobacco_7days,Avg_Weekend_Other_Tobacco_7days,SSAGA_FTND_Score,SSAGA_HSI_Score,SSAGA_TB_Age_1st_Cig,SSAGA_TB_DSM_Difficulty_Quitting,SSAGA_TB_DSM_Tolerance,SSAGA_TB_DSM_Withdrawal,SSAGA_TB_Hvy_CPD,SSAGA_TB_Max_Cigs,SSAGA_TB_Reg_CPD,SSAGA_TB_Smoking_History,SSAGA_TB_Still_Smoking,SSAGA_TB_Yrs_Since_Quit,SSAGA_TB_Yrs_Smoked,SSAGA_Times_Used_Illicits,SSAGA_Times_Used_Cocaine,SSAGA_Times_Used_Hallucinogens,SSAGA_Times_Used_Opiates,SSAGA_Times_Used_Sedatives,SSAGA_Times_Used_Stimulants,SSAGA_Mj_Use,SSAGA_Mj_Ab_Dep,SSAGA_Mj_Age_1st_Use,SSAGA_Mj_Times_Used
101208,35,true,NotMZ,DZ,51330_81195,51330,81195,,Black or African Am.,Hispanic/Latino,100,2,8,17,0,1,1,63,133,23.56,1,1,1,37,39,115,76,0.85,5.5,0,,0,,0,,1,,15,2,27,,,,0,,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,2,50,2,51,4,56,2,51,2,50,1,50,1,51,0,50,6,3,8,46,2,38,10,20,41,6,56,4,51,1,51,1,50,1,50,1,0,1,50,0,0,1,1,0,NORMAL,B,20,16,-2.5,false,false,false,false,false,false,false,false,0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,1,0,1,,,,,,,,,,,,0,0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,0,0.0,0.0,,,,,,,,,,0,0,,,0,0,0,0,0,0,0,0,,0
So from what I know until now, I have to attach a phenotype to the .fam file. So I try the following. Using age as an example phenotype
./plink --bfile genotypefile --pheno phenotype.csv --pheno-name Age_in_Yrs --make-bed --out filename
and this happens:
aldo@dell1:~/Desktop/PLINK$ ./plink --bfile MEGA_Chip --pheno rest.csv --pheno-name Age_In_Years --make-bed --out mergedage
PLINK v1.90b6.21 64-bit (19 Oct 2020) www.cog-genomics.org/plink/1.9/
(C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3
Logging to mergedage.log.
Options in effect:
--bfile MEGA_Chip
--make-bed
--out mergedage
--pheno rest.csv
--pheno-name Age_In_Years
32117 MB RAM detected; reserving 16058 MB for main workspace.
2119803 variants loaded from .bim file.
1141 people (523 males, 618 females) loaded from .fam.
Error: Line 1 of --pheno file has fewer tokens than expected.
So i'm stuck at this error (Line 1 of --pheno file has fewer tokens than expected.). Modifying the phenotype.csv is a non issue, the file is small. However, I can't open the .ped file because it too big (9.7GB) and my computer just dies trying to do so.
Somehow yesterday I managed to modify the phenotype.csv in a way that the error turned into Line 1 of --(.ped , i think) file has fewer tokens than expected. I seem to have deleted columns or shifted them so that they matched (FID IID).
Any help would be appreciated
Thanks! :)
I think you need space or tab separated file as pheno file, not comma, see pheno manual:
Thanks for the reply!
Ok, I did what you suggested. corrected the pheno file to be tab delimited. After that, the error changed to Line 1 of --fam file has fewer tokens than expected.
So I decided to also change the .fam file to be tab delimited
then this happened.
Error: --pheno-name requires the --pheno file to have a header line with first two columns 'FID' and 'IID'
so I edited the fam and pheno files in a way that they both had matching FID/IID as the first 2 columns
.fam
pheno:
and now I get this:
The fam file is now .csv because I changed it to be tab delimited, but this should be an issue because I specified it in --fam
So the issue now is that it's not recognizing any phenotype.
This is how the end result .fam file looks like. With all the -9s of the missing phenotypes
Thanks again for the help!!!