how to create (cases, controls, and NA) in a phenotype column based on IDs in another txt file
1
0
Entering edit mode
2.3 years ago

Hello, I am preparing a phenotype file for a GWAS.

I have a large 44k participant txt file (containing all cohort participants). Column1=FID, Column=IID, Coumn3=pseudoID I want to create a 4th column with my phenotype of interest (1=case, 0=control, NA=all other participants). I have 2 separate text files that contain just a column with the pseudoID for my controls and antother txt file for my cases.

(1) How do i create a header for the 4th column?

(2) How do i join the pseudoID from the separate control and case txt file to create a 0 or 1 as required in the 4th column.

(3) How do the remaining empty rows in the 4th column become NA?

I will be using Regenie for the GWAS. Any help would be appreciated. Thank you.

GWAS • 1.1k views
ADD COMMENT
0
Entering edit mode

You should include the first few lines from each file, and an example of what you want the outcome to look like.

ADD REPLY
0
Entering edit mode

44k participant file txt

FID IID Pseudo_ID
1 150023532 E78GJHI
1 150023457 E96GH25
1 150075826 E56HFT7 
1 150065943 EH87HN7
1 150034923 ENM8H53

Case txt

E78GJHI
ENM8H53

Control txt

E96GH25
EH87HN7

The expected output Phenotype File result

FID IID Pseudo_ID ICD_10
1 150023532 E78GJHI 1
1 150023457 E96GH25 0
1 150075826 E56HFT7 NA
1 150065943 EH87HN7 0
1 150034923 ENM8H5 1
ADD REPLY
1
Entering edit mode
2.3 years ago

add 1 to your cases.txt with sed 's/$/\t/' and sort on the first column. same for controls.

sort the 44k file on tab (assuming tab)

sort -t $'\t' -k3,3 44k.txt > sorted .txt

use join -t $'\t' -1 3 -2 1 sorted .txt cases_2.txt to join the cases. Do the same for the controls, see the option -v of join for the missing samples.

use awk 'BEGIN{printf("Header\n");} {print}' to add the header.

ADD COMMENT
0
Entering edit mode

Thank you so much for the suggestions. A couple of issues I am encountering.


When I sort the 44k tab txt file sort -t $'\t' -k3,3 44k.txt > sorted .txt
I get the header at the bottom. Not sure whether that is the issue for the following step. I have tried to fix unsuccessfully.


When i perform join -t $'\t' -1 3 -2 1 sorted .txt cases_2.txt the output looks like this:


E78GJHI 1      150023532      1 
ENM8H53 1      150034923      1

Anything I can do to fix this? thank you

ADD REPLY

Login before adding your answer.

Traffic: 2147 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6