Question

Convert the genotype format from numeric to letters

1

Entering edit mode

5.5 years ago

waqaskhokhar999 ▴ 160

I want to use IVAS for sQTL analysis and it accepts only allelic encoding of genotypes, so that they should be two letters of A,C,G,T

The format of my vcf file is like this:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265

1   73  .   C   A   40  PASS    .   GT:DP:GQ    0|0:5:40    0|0:9:40    0|0:6:38    ./.:.:.

1   83  .   T   C,A 40  PASS    .   GT:DP:GQ    1|1:5:40    1|1:9:40    0|0:8:38    ./.:.:.

I want to convert the genotype format from numeric to letters

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  108 139 159 265

1   73  .   C   A   40  PASS    .   GT:DP:GQ    CC  CC  CA  NA

1   83  .   T   C,A 40  PASS    .   GT:DP:GQ    TC  TC  TT  NA

SNP vcf • 3.2k views

ADD COMMENT • link updated 2.3 years ago by shaunjc • 0 • written 5.5 years ago by waqaskhokhar999 ▴ 160

score 0 · Answer 1 · 2019-05-22

#!/usr/bin/env python2
#!/coding:utf-8
import sys
import os
import gzip


class Letter(object):
    standard_nt = ['A', 'C', 'G', 'T']

    def __init__(self, vcf):
        self.vcf = vcf
        self.get_head()

    def _open(self):
        if os.path.splitext(self.vcf)[-1] == '.gz':
            input = gzip.open(self.vcf)
        else:
            input = open(self.vcf)
        return input

    def get_head(self):
        input = self._open()
        self.vcfhead = []
        for h in input:
            if h.startswith('#'):
                self.vcfhead.append(h)
            else:
                break
        input.close()

    def parse_vcf(self):
        input = self._open()
        for i in input:
            if i.startswith('#'):
                continue
            else:
                line = i.strip().split()
                line[7]='.'
                line[8]='GT'
                ref = line[3]
                alt = line[4]
                if len(ref) > 1 or len(alt) > 1:
                    print 'only accept BIALLELIC SNP\nremove this site\n'
                    continue
                elif ref not in self.standard_nt or alt not in self.standard_nt:
                    print 'only accept valid nucleotide\n'
                    continue
                else:
                    part1 = '\t'.join(line[0:9])
                    sample_gt = []
                    for gt in line[9:]:
                        gt = gt.split(':')[0]
                        if gt == '0|0' or gt == '0/0':
                            letter = ref*2
                        elif gt == '0|1'or gt == '0/1':
                            letter = ref+alt
                        elif gt == '1|1'or gt == '1/1':
                            letter = alt*2
                        else:
                            letter = 'NA'
                        sample_gt.append(letter)
                    sample_gt = '\t'.join(sample_gt)
                    all = part1+'\t'+sample_gt
                    yield all
        input.close()

    def write_out(self):
        with gzip.open('letter.'+self.vcf, 'w') as f:
            f.write(''.join(self.vcfhead))
            for line in self.parse_vcf():
                f.write(line+'\n')

if __name__ == '__main__':
    vcf = sys.argv[1]
    Letter(vcf).write_out()

save the script as letter.py,run `python letter.py yours.vcf(.gz format is ok too).The file with 'letter.' prefix is the result file.

score 0 · Answer 2 · 2022-08-19

I have a similar problem, where I have downloaded a file that seems to be in phased nucleotide format (A|A, A|G, G|G, etc). How could I modify this script to convert it back based into the ref and alt of to have phased numeric values (0|0, 0|1, 1|1)?

I feel like it would need a lot of modifying to determine which of the nucleotide matches ref or alt and then convert accordingly?