Question

Back-filling missing genotypes in merged VCF

4

Entering edit mode

10.1 years ago

Katie D'Aco ★ 1.1k

Is there a good way to distinguish ./. from 0/0 in a merged vcf? For example, a tool that goes back to the bam files for missing genotypes and checks if it's homozygous reference or a NO CALL? I would imagine this would be important to do, especially in 30x WGS where there are a lot of low coverage areas that lead to no calls.

Although, I guess if you have the bam files maybe the best thing to do is joint variant calling?

vcf • 9.3k views

ADD COMMENT • link updated 23 months ago by Pierre Lindenbaum 164k • written 10.1 years ago by Katie D'Aco ★ 1.1k

1

Entering edit mode

funnily, I wrote this tool Friday, give me a few minutes to push my sources....

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

here it is : https://github.com/lindenb/jvarkit/commit/e102079cf8a284c52782177bd12ed2edaddf1dba

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

3.1 years ago

Julia • 0

Hello,

I have a problem.

[SEVERE][FixVcfMissingGenotypes]No BAM index available for bam.list 
[INFO][Launcher]fixvcfmissinggenotypes Exited with failure (-1)

but I have the .bam and .bai in the same directory.

5,9M Jul 9 12:13 genome1.bai
2,7G Jul 15 07:40 genome1.bam
5,9M Jul 9 12:17 genome2.bai
2,8G Jul 15 07:39 genome2.bam

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 3.1 years ago by Julia • 0

0

Entering edit mode

what was the command line ?

ADD REPLY • link 3.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

23 months ago

rafael • 0

Hello!

I'm trying to use the software FixVcfMissingGenotypes and i've been facing some issues.

command used:

java -jar /dist/fixvcfmissinggenotypes.jar -B bamzin.list Exomas_merge_DP16_3ind.vcf > teste6_ind_fixed.vcf

afterwards the program prints the following on my terminal:

[INFO][FixVcfMissingGenotypes]Count: 107,106 Elapsed: 11 seconds(68.00%) Remains: 5 seconds(32.00%) Last: chr14:20,147,053

[INFO][FixVcfMissingGenotypes]. Completed. N=159,095. That took:16 seconds

my problem is: the VCF file it creates does not update the missing variants (./.) to reference (0/0) at all (i checked using grep), and updates the DP on the missing variants to 0.

Does anybody have an idea of what i've been doing wrong?

Thanks in advance :)

ADD COMMENT • link 23 months ago by rafael • 0

0

Entering edit mode

what the output of

bcftools query -l Exomas_merge_DP16_3ind.vcf

and

cat bamzin.list | samtools samples

ADD REPLY • link 23 months ago by Pierre Lindenbaum 164k

0

Entering edit mode

thanks for the reply!

bcftools query -l Exomas_merge_DP16_3ind.vcf
01GO_S2
01MAR_S1
01MG_S3

and

cat ../bamzin.list | samtools samples
.       /home/rafaeltou/homozigoty/bai/01GO_S2.bam
.       /home/rafaeltou/homozigoty/bai/01MAR_S1.bam
.       /home/rafaeltou/homozigoty/bai/01MG_S3.bam

ADD REPLY • link 23 months ago by rafael • 0

0

Entering edit mode

Your samples are missing read groups with SM: . This should have been specified when the read were mapped. See https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups

ADD REPLY • link 23 months ago by Pierre Lindenbaum 164k

Ram · Accepted Answer · 2014-11-09

8

Entering edit mode

10.1 years ago

Pierre Lindenbaum 164k

I just wrote http://lindenb.github.io/jvarkit/FixVcfMissingGenotypes.html It takes a merged VCF and uses the original Bams to fill the missing genotypes. . If the number of reads is greater than min.depth, then the missing genotype is said hom-ref. This tool is very new/alpha, use the github issue tracker to tell me if there is a problem please.

Usage:

$ yourtool-mergingvcf 1.vcf 2.vcf 3.vcf > merged.vcf
$ find ./ -name "*.bam" > bams.txt
$  java -jar dist/fixvcfmissinggenotypes.jar -f bams.txt merged.vcf > out.vcf

ADD COMMENT • link updated 2.8 years ago by Ram 44k • written 10.1 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Excellent! I had to think for a minute about how "correct" only using read depth is to decipher ./. from 0/0, but we also know that there was originally no variant called at that location. So it will generally be pretty accurate. Looking forward to trying this tool.

ADD REPLY • link 10.1 years ago by Katie D'Aco ★ 1.1k

0

Entering edit mode

yes , I didn't want to create another SNP caller :-)

ADD REPLY • link 10.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

When I wanted to compile this submodule I got the following error:

    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:70: illegal start of type
    [javac]     private Map<String,Set<File>> sample2bam=new HashMap<>();
    [javac]                                                          ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:103: illegal start of type
    [javac]             Set<File> bamFiles=new HashSet<>();
    [javac]                                            ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:163: illegal start of type
    [javac]                                             set=new HashSet<>();
    [javac]                                                             ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:209: illegal start of type
    [javac]                             List<SamReader> samReaders= new ArrayList<>(bams.size());
    [javac]                                                                       ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:270: illegal start of type
    [javac]                                     List<Allele> homozygous=new ArrayList<>(2);
                                    ^
    [javac] 5 errors

BUILD FAILED
/jvarkit/build.xml:836: The following error occurred while executing this line:
/jvarkit/build.xml:238: Compile failed; see the compiler error output for details.

Does anybody have the same problem?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by janez.jenko ▴ 10

0

Entering edit mode

What's your version of java ?

$ javac -version
javac 1.7.0_07

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks! That was the problem, I had a java version "1.6.0_32" and when using java version "1.7.0_09" compilation was successful.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by janez.jenko ▴ 10

0

Entering edit mode

Is there a plan to do this also for vcf.gz files?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by janez.jenko ▴ 10

1

Entering edit mode

it should work with vcf.gz

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Using java 1.0.7_09 I tried to fill in the missing genotypes. I got several error messages:

[SEVERE/FixVcfMissingGenotypes] 2014-12-09 23:04:04 "For input string: ".""
java.lang.NumberFormatException: For input string: "."
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
        at java.lang.Double.parseDouble(Double.java:540)
        at htsjdk.variant.variantcontext.GenotypeLikelihoods.parseDeprecatedGLString(GenotypeLikelihoods.java:250)
        at htsjdk.variant.variantcontext.GenotypeLikelihoods.fromGLField(GenotypeLikelihoods.java:81)
        at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:724)
        at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:134)
        at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:123)
        at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:353)
        at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:285)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:263)
        at com.github.lindenb.jvarkit.util.vcf.VcfIterator.next(VcfIterator.java:63)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.doWork(FixVcfMissingGenotypes.java:219)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMain(AbstractCommandLineProgram.java:470)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMainWithExit(AbstractCommandLineProgram.java:484)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.main(FixVcfMissingGenotypes.java:328)
[INFO/FixVcfMissingGenotypes] 2014-12-09 23:04:04 "End JOB status=-1 [Tue Dec 09 23:04:04 GMT 2014] com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes done. Elapsed time: 0.02 minutes."
[SEVERE/FixVcfMissingGenotypes] 2014-12-09 23:04:04 "##### ERROR: return status = -1################"

Are all these problems related to the version of java? As I do not have privileges to install programs on a computer where I am running this job I can not test in a sort time if java version 1.0.7_09 would solve this problem.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by janez.jenko ▴ 10

0

Entering edit mode

no, it's a problem with your vcf file. the HTS-JDK library is not able to parse it (an old format ?). It seems that it contains a Genotype/GL with only a ".".

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

As a test, try to parse your VCF with a picard tool like SplitVCF http://broadinstitute.github.io/picard/command-line-overview.html.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Known problem: http://gatkforums.broadinstitute.org/discussion/4362/error-message-for-input-string-when-validating-variants

This is saying that you have "." in a genotype likelihood field, which I think shouldn't happen.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

Thank you very much Pierre, your comments are really helpful. Missing "." genotypes likelihood is generated after the VCF files from different individuals were merged together. They are occurring on multi allelic sites with different alternative variants in different samples. Maybe a possible solution would be to keep only the biallelic sites and back fill the missing genotypes only for them. The other option would be to fill in -10 for all the missing GL sites.

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 10.0 years ago by janez.jenko ▴ 10

0

Entering edit mode

I try to filling missing genotypes into vcf files from bam files.

Once I used the realigned-bamfiles, it works.

But I used the original bamfiles, it doesn't work.

I got the error message;

[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "Reading header for /home/User3/data/original/RB4.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "reading from /home/User3/data/annotated.sample23-haplo.eff.vcf"
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "Adding 'java.io.tmpdir' directory to the list of tmp directories"
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "Sample: PRO28-RB12"
[SEVERE/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "null"
java.lang.NullPointerException
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.doWork(FixVcfMissingGenotypes.java:209)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMain(AbstractCommandLineProgram.java:470)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMainWithExit(AbstractCommandLineProgram.java:484)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.main(FixVcfMissingGenotypes.java:328)
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "End JOB status=-1 [Fri Jan 23 10:31:51 EST 2015] com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes done. Elapsed time: 0.52 minutes."
[SEVERE/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "##### ERROR: return status = -1################"

Can you help me to solve this trouble?

Thanks!

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 9.8 years ago by biobigdata • 0

0

Entering edit mode

Can you see the read-GROUP with "SN:PRO28-RB12"? in

samtools -H /home/User3/data/original/RB4.bam

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 9.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

No, I can't see SN:PRO28-Rb12.

PRO28-RB12 and RB4 are different samples.

$ fixvcf -f bam_lists.txt /home/User3/data/GATK/annotated.sample23-haplo.eff.vcf > haplo.test2.vcf
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:32:24 "Starting JOB at Thu Jan 22 16:32:24 EST 2015 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=801e96ea74dc515bb5de8dd02f64063c0cd137aa  built=2015-01-12 15:52:14"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:32:24 "Command Line args : -f bam_lists.txt /home/User3/data/GATK/annotated.sample23-haplo.eff.vcf"


[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/PRO28-RB12.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/RB21.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/RB16.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/RB20.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:55 "Reading header for /home/User3/data/original/RB1.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:55 "Reading header for /home/User3/data/original/RB17.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:55 "Reading header for /home/User3/data/original/RB28.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:17 "Reading header for /home/User3/data/original/RB7.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "Reading header for /home/User3/data/original/RB4.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "reading from /home/jbyun/User3/data/GATK/annotated.sample23-haplo.eff.vcf"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "Adding 'java.io.tmpdir' directory to the list of tmp directories"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "Sample: PRO28-RB12"
[SEVERE/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "null"
java.lang.NullPointerException
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.doWork(FixVcfMissingGenotypes.java:209)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMain(AbstractCommandLineProgram.java:470)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMainWithExit(AbstractCommandLineProgram.java:484)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.main(FixVcfMissingGenotypes.java:328)
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "End JOB status=-1 [Thu Jan 22 16:34:35 EST 2015] com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes done. Elapsed time: 2.19 minutes."
[SEVERE/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "##### ERROR: return status = -1################"

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 9.8 years ago by biobigdata • 0

0

Entering edit mode

When I used realigned BAM files, it worked.

Some of ./. are fixed and others not.

[INFO/FixVcfMissingGenotypes] 2015-01-22 16:36:17 "Sample: RB19"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:36:17 "Opening /home/User3/data/GATK/recal-realigned-s-GATK-RB19.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:36:30 "done sample RB19 fixed=15 not-fixed=46 total=7991 genotypes"

What does not-fixed ones indicate?

Thanks!

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 9.8 years ago by biobigdata • 0

0

Entering edit mode

Thanks for this excellent tool! Is there a way to speed it up, like multi-thread computing?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 9.0 years ago by Guoshuai Cai • 0

0

Entering edit mode

Thanks for developing this tool. Is there a way to speed it up..

ADD REPLY • link 7.2 years ago by aadeokar • 0

0

Entering edit mode

I'm afraid no :-) you can always try to split by region and the concat the VCFs later. Or the brute force: set all no calls to hom-ref: http://lindenb.github.io/jvarkit/VcfNoCallToHomRef.html

ADD REPLY • link 7.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Hello,

I tried running this command and got this:

[SEVERE][Launcher]There was an error in the command line arguments. Please check your parameters : Expected one or zero argument but got 2 : [bams.txt, merged_filtered_sort.recode.vcf]

What can I do to resolve this?

ADD REPLY • link updated 2.8 years ago by Ram 44k • written 3.9 years ago by nitinra ▴ 50