Question

Convert genotype to nucleotide genotype?

0

Entering edit mode

8.6 years ago

SheelS ▴ 40

Hi there, I have a csv file looks like below, and I would like to convert the genotypes (A or B) to genotype nucleotide (A, T, G, C).

Marker, alleleA, alleleB,  id1, id1, id2, id2,    # id is people
rs1,          G,       T,    A,   A,   A,   B,
rs2,          A,       G,    B,   A,   A,   A, 
rs3,          C,       T,    0,   0,   A,   B,    # 0 is missing

After converting.....

Marker, alleleA, alleleB,  id1, id1, id2, id2,    
rs1,          G,       T,    G,   G,   G,   T, # for rs1, the 'A' is replaced by G, and 'B' by T
rs2,          A,       G,    G,   A,   A,   A, 
rs3,          C,       T,    0,   0,   C,   T,

Does anyone know any smart way to do this ?

Thanks for any possible solution or advice in advance !

R genotype python • 2.6k views

ADD COMMENT • link updated 8.6 years ago by WouterDeCoster 47k • written 8.6 years ago by SheelS ▴ 40

score 0 · Answer 1 · 2016-04-11

0

Entering edit mode

8.6 years ago

WouterDeCoster 47k

An almost python functional script (to not completely spoil the fun for you). Poorly indented.

import csv, sys

file = csv.reader(sys.argv[1], delimiter=',')
with open('converted.csv'. 'w') as output:
for line in file:
   genotypes = [] 
   pos = line[0]
   alleleA = line[1]
   alleleB = line[2]
   for samplefield in line[3:]:
      if samplefield == 'A':
           genotypes.append(alleleA)
      elif samplefield == 'B'
          genotypes.append(alleleB)
      else:
          sys.exit('Unexpected input!')
  outfile.write("{}\t{}\t{}\t{}\n".format(pos, alleleA, alleleB, '\t'.join(genotypes)))

Let me know if you need some more help getting it working.

ADD COMMENT • link 8.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Thanks, but sorry it looks like it does not work, IndexError: list out of range alleleA = [1]

ADD REPLY • link 8.6 years ago by SheelS ▴ 40

1

Entering edit mode

Oh crap I made a mistake, will edit. Delimiter = ','

ADD REPLY • link 8.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry, but issue still there. Any ideas?

ADD REPLY • link 8.6 years ago by SheelS ▴ 40

0

Entering edit mode

This should do the trick now. My bad, did some bad pseudocode in the above almost functional script above. This script definitely isn't the most efficient if you have millions of lines, but it should do the job. Warning: python2.7 synthax. I could write it in a few lines shorter, but readability counts and explicit is better than implicit, ya know ;-)

import csv, sys

with open(sys.argv[1]) as input, open('converted.csv', 'w') as output:
    data = csv.reader(input, delimiter=',')
    for line in data:
        if line[0].startswith('Marker'): #Catch the headerline
            output.write("{}\n".format(','.join(line)))
        else:
            genotypes = [] 
            pos = line[0]
            alleleA = line[1]
            alleleB = line[2]
            for samplefield in line[3:]:
                if samplefield == 'A':
                    genotypes.append(alleleA)
                elif samplefield == 'B':
                    genotypes.append(alleleB)
                elif samplefield == '0':
                    genotypes.append('0')
                else:
                    sys.exit('!! Unexpected input: {}'.format(','.join(line)))
            output.write("{},{},{},{}\n".format(pos, alleleA, alleleB, ','.join(genotypes)))

I wonder where the trailing comma comes from in your code example (at the end of the line). Correct me if I'm wrong, but is that common for a csv file? To easily remove it:

sed -i 's/,$//g' input.csv

ADD REPLY • link 8.6 years ago by WouterDeCoster 47k

0

Entering edit mode

Sorry but still the same issue, list out of range would you mind I take your code to ask in somewhere? No offend, then we can know how to solve it ! : )