How to "haploidize" diploid SNPs data in a vcf file
2
0
Entering edit mode
7.8 years ago
BioinfoNovice ▴ 110

Dear all,

I have a vcf file containing SNPs data of multiple individuals where most are haploid while some are diploid. What I want to do is to "haploidize" the diploid individual, meaning that I want to to randomly take one allel out for the two for all loci of those individuals.

What kind of tools can I use to do that ? Or what is the scripting manner to adopt ?

For example,

scaffold1       25042   .       G       A       13300.6 PASS    AC=5;AF=0.179;AN=28;BaseQRankSum=-5.920e-01;ClippingRankSum=0.373;DP=1268;ExcessHet=0.7918;FS=5.925;MQ=57.37;MQRankSum=-1.031e+00;QD=32.39;ReadPosRankSum=0.943;SOR=1.255       GT:AD:DP:GQ:PL  0/0:41,0:41:90:0,90,1528        0/1:13,16:29:99:616,0,498       0:56,0:56:99:0,1800     0:66,0:66:99:0,1800     0:82,0:82:99:0,1800     0/0:19,0:19:33:0,33,495

I have the 1st, 2nd and 6th individuals are diploid while the others are haploid...what I want to do is to randomly take one of the two alleles for the diploid individuals... The resulting file could be another vcf file or under the form of tab separated file.

Any suggestion ?

Thank you very much in advance.

vcf SNP ploidy • 7.0k views
ADD COMMENT
0
Entering edit mode

if you were to work with the same problem but on the plink ped files, here's my approach after recoding alleles as 0/1/2:

import random
import itertools
infile = open("recode12.ped","r")
outfile = open("new_genotypes.ped","w")

check_11 = ('1','1')
check_22 = ('2','2')
def random_gen():
        choice_m = [check_11,check_22]
        res = random.choice(choice_m)
        return res

def parse_ped(ped_infile):
        check_12 = ('1','2')
        check_21 = ('2','1')

        for line in ped_infile:
                row = line.split()
                head = row[0:6]
                genotype_old = row[6:]
                genotype_f = row[6:]
                it = iter(genotype_f)
                genotype_t = zip(it, it)
                for n,i in enumerate(genotype_t):
                        if i == check_12 or i == check_21:
                                genotype_t[n]=random_gen()
                print genotype_t[0:10]
                genotype_t = list(itertools.chain(*genotype_t))
                print genotype_t[0:10]

                new_line = head+genotype_t
                outfile.write(" ".join(new_line)+"\n")
        outfile.close()

parse_ped(infile)
ADD REPLY
1
Entering edit mode
7.8 years ago

My solution using java/javascript nashorn. The script removes the INFO/FILTER/QUAL data and you might get some non-variant (variant with no ALT allele...).

ADD COMMENT
0
Entering edit mode

Thank you very much for sharing ! I don't know much about java but I will try to use it.

ADD REPLY
0
Entering edit mode

Hi,

I tried using the script. However, I get the error:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  BEN-SA3 BEN-ZOO1    BEN-CI5 BEN-SI5 BEN-CI2 BEN-NE2 BEN-SI1 BEN-NE1 BEN-CI7 BEN-NE3 BEN-ZOO2    BEN-NOR2    BEN-NOR1    BEN-CI6 BEN-CI8 BEN-CI3 BEN-SI3 BEN-ZOO6    BEN-ZOO3    BEN-SI2
biostar236160.js:59 ReferenceError: "Genotype" is not defined

I am not able to understand what is happening. Can you please help fix this error.

Thanks a lot

ADD REPLY
0
Entering edit mode

strangely the line

var Genotype = Java.type("htsjdk.variant.variantcontext.Genotype");

was missing. I've updated the code.

ADD REPLY
1
Entering edit mode
7.8 years ago
Len Trigg ★ 1.6k

Here is another simple approach using RTG Tools -- while it's still using javascript, it's got a more AWK'ish flavour than Pierre's solution, so it depends what you're most comfortable with:

$ cat haploidify.js
function record() {
  for (i = 0; i < SAMPLES.length; i++) {
    if (has(SAMPLES[i].GT)) {
      alleles = SAMPLES[i].GT.split(/\D/);
      SAMPLES[i].GT = alleles[Math.floor(Math.random() * alleles.length)];
    }
  }
}
$ rtg vcffilter -i input.vcf -o output.vcf --javascript haploidify.js
ADD COMMENT

Login before adding your answer.

Traffic: 2586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6