Question

Matching complementary genotypes from 2 different files

0

Entering edit mode

7.4 years ago

mittu1602 ▴ 200

I have 2 files tab separated as follows:

file1:

rs8192678 GG VO2max PPARGC1A Higher
rs2282679 AC Vitamin D deficiency

file2:

    rs8192678 GG 
    rs2282679 AC 
    rs8192678 CC
    rs2282679 AG

result: where even its complementary should also match with file1

rs8192678 GG GG VO2max PPARGC1A Higher 
rs8192678 GG CC VO2max PPARGC1A Higher
rs2282679 AC AC Vitamin D deficiency
rs2282679 AC AG Vitamin D deficiency

Thank you

genomics • 1.5k views

ADD COMMENT • link updated 7.4 years ago by Kevin Blighe 89k • written 7.4 years ago by mittu1602 ▴ 200

score 2 · Accepted Answer · 2017-12-28

2

Entering edit mode

7.4 years ago

Kevin Blighe 89k

This will match just on the rs ID (the piped sort command at the end is not necessary):

awk 'BEGIN {FS="\t"} FNR==NR {key=$1; arrayVars[key]=$0; next} {key=$1; if (arrayVars[key]) split(arrayVars[key], var, "\t"); if (arrayVars[key]) print var[1]"\t"var[2]"\t"$2"\t"var[3]}' file1 file2 | sort -r -k1
rs8192678   GG  GG  VO2max PPARGC1A Higher
rs8192678   GG  CC  VO2max PPARGC1A Higher
rs2282679   AC  AG  Vitamin D deficiency
rs2282679   AC  AC  Vitamin D deficiency

-----------------------------------

This will match on rs ID and genotype, should you require that in the future:

awk 'BEGIN {FS="\t"} FNR==NR {key=$1$2; arrayVars[key]=$0; next} {key=$1$2; if (arrayVars[key]) split(arrayVars[key], var, "\t"); if (arrayVars[key]) print var[1]"\t"var[2]"\t"$2"\t"var[3]}' file1 file2 | sort -r -k1
rs8192678   GG  GG  VO2max PPARGC1A Higher
rs2282679   AC  AC  Vitamin D deficiency

If you want an explanation, then let me know. There are undoubtedly many other ways to do this.

Kevin

ADD COMMENT • link 7.4 years ago by Kevin Blighe 89k

0

Entering edit mode

Hi Kevin, Thanks for the reply. But my aim is to look for complementary genotypes and there are high chances that file2 would also contain genotypes that are not complementary to file1 genotypes. if something can be looked in that way. but none the less, thank you so much.

ADD REPLY • link 7.4 years ago by mittu1602 ▴ 200

0

Entering edit mode

Please explain better with some sample input and output

ADD REPLY • link 7.4 years ago by Kevin Blighe 89k

0

Entering edit mode

File1:

 rs492602   **CC**  Vitamin B12 deficiency  FUT2    Higher levels of vitamin B12        
    rs492602    CT  Vitamin B12 deficiency  FUT2    Normal levels of vitamin B12        
    rs492602    TT     Vitamin B12 deficiency   FUT2    Normal levels of vitamin B12

File2:

    rs492602    **GG**  exm-rs492602
    rs492602    **CC**  exm-rs492602

Result:

rs492602**CC** **GG** Vitamin B12 deficiency    FUT2    Higher levels of vitamin B12    
rs492602**CC** **CC** Vitamin B12 deficiency    FUT2    Higher levels of vitamin B12

ADD REPLY • link 7.4 years ago by mittu1602 ▴ 200

0

Entering edit mode

Are those asterisks in the actual data or did you just put them there for this example?

ADD REPLY • link 7.4 years ago by Kevin Blighe 89k

0

Entering edit mode

for this example only, wanted to bold them to stand out!

ADD REPLY • link 7.4 years ago by mittu1602 ▴ 200

1

Entering edit mode

I see. Yes, the formatting doesn't always come out right on these forums.

You can try this:

cat file1 
rs492602    CC  Vitamin B12 deficiency  FUT2    Higher levels of vitamin B12
rs492602    CT  Vitamin B12 deficiency  FUT2    Normal levels of vitamin B12
rs492602    TT  Vitamin B12 deficiency  FUT2    Normal levels of vitamin B12

cat file2
rs492602    GG  exm-rs492602
rs492602    CC  exm-rs492602

paste <(cut -f2 file1 | tr [ATGC] [TACG]) file1 | awk 'BEGIN {FS="\t"} FNR==NR {key1=$2$3; key2=$2$1; arrayVars[key1]=$0; arrayVars[key2]=$0; next} {key=$1$2; if (arrayVars[key]) split(arrayVars[key], var, "\t"); if (arrayVars[key]) print var[2]"\t"var[3]"\t"$2"\t"var[4]"\t"var[5]"\t"var[6]}' - file2
rs492602    CC  GG  Vitamin B12 deficiency  FUT2    Higher levels of vitamin B12
rs492602    CC  CC  Vitamin B12 deficiency  FUT2    Higher levels of vitamin B12

Here I find the complementary base with the first part of the pipe with paste <(cut -f2 file1 | tr [ATGC] [TACG]) file1, and this is then piped into awk as stdin. If there are also more columns to print out, just add more var[2]"\t" to the final print command.

Hope that this works!

ADD REPLY • link 7.4 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you so much @Kevin it really worked and helped me. Much appreciated for your time over this! Can u also suggest me that if I can use it as a shell script? Happy New Year!