How to "haploidize" diploid SNPs data in a vcf file
2
0
Entering edit mode
8.2 years ago
BioinfoNovice ▴ 110

Dear all,

I have a vcf file containing SNPs data of multiple individuals where most are haploid while some are diploid. What I want to do is to "haploidize" the diploid individual, meaning that I want to to randomly take one allel out for the two for all loci of those individuals.

What kind of tools can I use to do that ? Or what is the scripting manner to adopt ?

For example,

scaffold1       25042   .       G       A       13300.6 PASS    AC=5;AF=0.179;AN=28;BaseQRankSum=-5.920e-01;ClippingRankSum=0.373;DP=1268;ExcessHet=0.7918;FS=5.925;MQ=57.37;MQRankSum=-1.031e+00;QD=32.39;ReadPosRankSum=0.943;SOR=1.255       GT:AD:DP:GQ:PL  0/0:41,0:41:90:0,90,1528        0/1:13,16:29:99:616,0,498       0:56,0:56:99:0,1800     0:66,0:66:99:0,1800     0:82,0:82:99:0,1800     0/0:19,0:19:33:0,33,495

I have the 1st, 2nd and 6th individuals are diploid while the others are haploid...what I want to do is to randomly take one of the two alleles for the diploid individuals... The resulting file could be another vcf file or under the form of tab separated file.

Any suggestion ?

Thank you very much in advance.

vcf SNP ploidy • 7.4k views
ADD COMMENT
0
Entering edit mode

if you were to work with the same problem but on the plink ped files, here's my approach after recoding alleles as 0/1/2:

import random
import itertools
infile = open("recode12.ped","r")
outfile = open("new_genotypes.ped","w")

check_11 = ('1','1')
check_22 = ('2','2')
def random_gen():
        choice_m = [check_11,check_22]
        res = random.choice(choice_m)
        return res

def parse_ped(ped_infile):
        check_12 = ('1','2')
        check_21 = ('2','1')

        for line in ped_infile:
                row = line.split()
                head = row[0:6]
                genotype_old = row[6:]
                genotype_f = row[6:]
                it = iter(genotype_f)
                genotype_t = zip(it, it)
                for n,i in enumerate(genotype_t):
                        if i == check_12 or i == check_21:
                                genotype_t[n]=random_gen()
                print genotype_t[0:10]
                genotype_t = list(itertools.chain(*genotype_t))
                print genotype_t[0:10]

                new_line = head+genotype_t
                outfile.write(" ".join(new_line)+"\n")
        outfile.close()

parse_ped(infile)
ADD REPLY
1
Entering edit mode
8.2 years ago

My solution using java/javascript nashorn. The script removes the INFO/FILTER/QUAL data and you might get some non-variant (variant with no ALT allele...).

/**
Java JDK 8 is required.
The jjs command-line tool is used to invoke the java/javascript Nashorn engine.
You can use it to interpret one or several script files, or to run an interactive shell.
HTSJDK is a Java API for high-throughput sequencing data (HTS) formats.
You can download the pre-compiled libraries from maven central.
```
mkdir -p lib/com/github/samtools/htsjdk/2.5.1/ && wget -O "lib/com/github/samtools/htsjdk/2.5.1/htsjdk-2.5.1.jar" "http://central.maven.org/maven2/com/github/samtools/htsjdk/2.5.1/htsjdk-2.5.1.jar"
mkdir -p lib/commons-logging/commons-logging/1.1.1/ && wget -O "lib/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar" "http://central.maven.org/maven2/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar"
mkdir -p lib/gov/nih/nlm/ncbi/ngs-java/1.2.4/ && wget -O "lib/gov/nih/nlm/ncbi/ngs-java/1.2.4/ngs-java-1.2.4.jar" "http://central.maven.org/maven2/gov/nih/nlm/ncbi/ngs-java/1.2.4/ngs-java-1.2.4.jar"
mkdir -p lib/org/apache/commons/commons-compress/1.4.1/ && wget -O "lib/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar" "http://central.maven.org/maven2/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar"
mkdir -p lib/org/apache/commons/commons-jexl/2.1.1/ && wget -O "lib/org/apache/commons/commons-jexl/2.1.1/commons-jexl-2.1.1.jar" "http://central.maven.org/maven2/org/apache/commons/commons-jexl/2.1.1/commons-jexl-2.1.1.jar"
mkdir -p lib/org/tukaani/xz/1.5/ && wget -O "lib/org/tukaani/xz/1.5/xz-1.5.jar" "http://central.maven.org/maven2/org/tukaani/xz/1.5/xz-1.5.jar"
mkdir -p lib/org/xerial/snappy/snappy-java/1.0.3-rc3/ && wget -O "lib/org/xerial/snappy/snappy-java/1.0.3-rc3/snappy-java-1.0.3-rc3.jar" "http://central.maven.org/maven2/org/xerial/snappy/snappy-java/1.0.3-rc3/snappy-java-1.0.3-rc3.jar"
```
How to invoke this script:
```
jjs \
-cp "/path/to/lib/com/github/samtools/htsjdk/2.5.1/htsjdk-2.5.1.jar:/path/to/lib/commons-logging/commons-logging/1.1.1/commons-logging-1.1.1.jar:/path/to/lib/gov/nih/nlm/ncbi/ngs-java/1.2.4/ngs-java-1.2.4.jar:/path/to/lib/org/apache/commons/commons-compress/1.4.1/commons-compress-1.4.1.jar:/path/to/lib/org/apache/commons/commons-jexl/2.1.1/commons-jexl-2.1.1.jar:/path/to/lib/org/tukaani/xz/1.5/xz-1.5.jar:/path/to/lib/org/xerial/snappy/snappy-java/1.0.3-rc3/snappy-java-1.0.3-rc3.jar" \
biostar236160.js -- input.vcf
```
*/
/** check the number of arguments */
if( arguments.length!=1) {
java.lang.System.err.println("Usage: <vcf file>");
java.lang.System.exit(-1);
}
/** get ArrayList and HashSet for handling genotypes and alleles */
var ArrayList= Java.type("java.util.ArrayList");
var HashSet= Java.type("java.util.HashSet");
/** random number */
var Random = Java.type("java.util.Random");
var rand = new Random();
/** get the java class htsjdk.variant.vcf.VCFFileReader https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFFileReader.html */
var VCFFileReader = Java.type("htsjdk.variant.vcf.VCFFileReader");
/** get a class for defining a VCF writer */
var VariantContextWriterBuilder = Java.type("htsjdk.variant.variantcontext.writer.VariantContextWriterBuilder");
var Options = Java.type("htsjdk.variant.variantcontext.writer.Options");
/** get the java class for building variants and genotypes */
var VariantContextBuilder = Java.type("htsjdk.variant.variantcontext.VariantContextBuilder");
var GenotypeBuilder = Java.type("htsjdk.variant.variantcontext.GenotypeBuilder");
var Genotype = Java.type("htsjdk.variant.variantcontext.Genotype");
/** create a File object */
var vcfFile =new java.io.File(arguments[0]);
/** create a new VCFFileReader for this file https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFFileReader.html#VCFFileReader-java.io.File-boolean- */
var requireIndex = false;
var vcfReader = new VCFFileReader( vcfFile, requireIndex );
/** get an iterator over the variants in this vcf file https://samtools.github.io/htsjdk/javadoc/htsjdk/htsjdk/variant/vcf/VCFFileReader.html#iterator-- */
var iter = vcfReader.iterator();
/** create a VCF writer to stdout */
var w = new VariantContextWriterBuilder().
unsetOption(Options.INDEX_ON_THE_FLY).
setOutputVCFStream(java.lang.System.out).
build();
/** write header */
w.writeHeader(vcfReader.getFileHeader());
/** while there is something to read */
while(iter.hasNext())
{
/** get the next variant */
var variant = iter.next();
var newgenotypes = new ArrayList();
var ctxalleles = new HashSet();
ctxalleles.add(variant.getReference());
for(var i=0;i< variant.getNSamples();++i)
{
var genotype= variant.getGenotype(i);
var alleles= genotype.getAlleles();
if(alleles.isEmpty() || genotype.isNoCall()) {
genotype = Genotype.createMissing(genotype.getSampleName(),1);
}
else if(alleles.size()>1)
{
var idx=rand.nextInt(alleles.size());
var a = alleles.get(idx);
ctxalleles.add(a);
var alist = new ArrayList();
alist.add(a);
genotype = new GenotypeBuilder(genotype.getSampleName(),alist).make();
}
else
{
ctxalleles.add(alleles.get(0));
}
newgenotypes.add(genotype);
}
w.add(new VariantContextBuilder(arguments[0],
variant.getContig(),
variant.getStart(),
variant.getEnd(),
ctxalleles
).genotypes(newgenotypes).make());
}
/** close the iterator */
iter.close();
/** close the vcfReader */
vcfReader.close();
/** close output */
w.close();

output:

(...)
#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	S1	S2	S3	S4
rotavirus	51	.	A	G	.	.	.	GT	0	0	0	1
rotavirus	91	.	A	.	.	.	.	GT	0	0	0	0
rotavirus	130	.	T	.	.	.	.	GT	0	0	0	0
rotavirus	232	.	T	A	.	.	.	GT	1	0	0	0
rotavirus	267	.	C	G	.	.	.	GT	1	0	0	0
rotavirus	424	.	A	G	.	.	.	GT	0	1	0	0
rotavirus	520	.	T	.	.	.	.	GT	0	0	0	0
rotavirus	536	.	A	T	.	.	.	GT	1	0	0	0
rotavirus	562	.	A	G	.	.	.	GT	0	0	0	1
rotavirus	583	.	G	.	.	.	.	GT	0	0	0	0
rotavirus	661	.	T	A	.	.	.	GT	0	1	0	0
rotavirus	693	.	T	G	.	.	.	GT	1	0	0	0
rotavirus	738	.	T	A	.	.	.	GT	0	0	1	0
rotavirus	799	.	A	C	.	.	.	GT	0	0	1	0
rotavirus	812	.	G	T	.	.	.	GT	0	0	1	0
rotavirus	833	.	G	A	.	.	.	GT	1	0	0	0
rotavirus	916	.	A	T	.	.	.	GT	1	0	0	0
rotavirus	946	.	C	A	.	.	.	GT	0	1	0	0
rotavirus	961	.	T	.	.	.	.	GT	0	0	0	0
rotavirus	1044	.	A	T	.	.	.	GT	0	1	0	0
rotavirus	1045	.	C	G	.	.	.	GT	0	0	1	0
rotavirus	1054	.	C	G	.	.	.	GT	0	1	0	0
rotavirus	1064	.	G	A	.	.	.	GT	0	0	0	1
view raw Example.md hosted with ❤ by GitHub

ADD COMMENT
0
Entering edit mode

Thank you very much for sharing ! I don't know much about java but I will try to use it.

ADD REPLY
0
Entering edit mode

Hi,

I tried using the script. However, I get the error:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  BEN-SA3 BEN-ZOO1    BEN-CI5 BEN-SI5 BEN-CI2 BEN-NE2 BEN-SI1 BEN-NE1 BEN-CI7 BEN-NE3 BEN-ZOO2    BEN-NOR2    BEN-NOR1    BEN-CI6 BEN-CI8 BEN-CI3 BEN-SI3 BEN-ZOO6    BEN-ZOO3    BEN-SI2
biostar236160.js:59 ReferenceError: "Genotype" is not defined

I am not able to understand what is happening. Can you please help fix this error.

Thanks a lot

ADD REPLY
0
Entering edit mode

strangely the line

var Genotype = Java.type("htsjdk.variant.variantcontext.Genotype");

was missing. I've updated the code.

ADD REPLY
1
Entering edit mode
8.2 years ago
Len Trigg ★ 1.7k

Here is another simple approach using RTG Tools -- while it's still using javascript, it's got a more AWK'ish flavour than Pierre's solution, so it depends what you're most comfortable with:

$ cat haploidify.js
function record() {
  for (i = 0; i < SAMPLES.length; i++) {
    if (has(SAMPLES[i].GT)) {
      alleles = SAMPLES[i].GT.split(/\D/);
      SAMPLES[i].GT = alleles[Math.floor(Math.random() * alleles.length)];
    }
  }
}
$ rtg vcffilter -i input.vcf -o output.vcf --javascript haploidify.js
ADD COMMENT

Login before adding your answer.

Traffic: 1854 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6