Back-filling missing genotypes in merged VCF
3
4
Entering edit mode
10.2 years ago
Katie D'Aco ★ 1.1k

Is there a good way to distinguish ./. from 0/0 in a merged vcf? For example, a tool that goes back to the bam files for missing genotypes and checks if it's homozygous reference or a NO CALL? I would imagine this would be important to do, especially in 30x WGS where there are a lot of low coverage areas that lead to no calls.

Although, I guess if you have the bam files maybe the best thing to do is joint variant calling?

vcf • 9.6k views
ADD COMMENT
1
Entering edit mode

funnily, I wrote this tool Friday, give me a few minutes to push my sources....

ADD REPLY
8
Entering edit mode
10.2 years ago

I just wrote http://lindenb.github.io/jvarkit/FixVcfMissingGenotypes.html It takes a merged VCF and uses the original Bams to fill the missing genotypes. . If the number of reads is greater than min.depth, then the missing genotype is said hom-ref. This tool is very new/alpha, use the github issue tracker to tell me if there is a problem please.

Usage:

$ yourtool-mergingvcf 1.vcf 2.vcf 3.vcf > merged.vcf
$ find ./ -name "*.bam" > bams.txt
$  java -jar dist/fixvcfmissinggenotypes.jar -f bams.txt merged.vcf > out.vcf
ADD COMMENT
1
Entering edit mode

Excellent! I had to think for a minute about how "correct" only using read depth is to decipher ./. from 0/0, but we also know that there was originally no variant called at that location. So it will generally be pretty accurate. Looking forward to trying this tool.

ADD REPLY
0
Entering edit mode

yes , I didn't want to create another SNP caller :-)

ADD REPLY
0
Entering edit mode

When I wanted to compile this submodule I got the following error:

    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:70: illegal start of type
    [javac]     private Map<String,Set<File>> sample2bam=new HashMap<>();
    [javac]                                                          ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:103: illegal start of type
    [javac]             Set<File> bamFiles=new HashSet<>();
    [javac]                                            ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:163: illegal start of type
    [javac]                                             set=new HashSet<>();
    [javac]                                                             ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:209: illegal start of type
    [javac]                             List<SamReader> samReaders= new ArrayList<>(bams.size());
    [javac]                                                                       ^
    [javac] /jvarkit/src/main/java/com/github/lindenb/jvarkit/tools/misc/FixVcfMissingGenotypes.java:270: illegal start of type
    [javac]                                     List<Allele> homozygous=new ArrayList<>(2);
                                    ^
    [javac] 5 errors

BUILD FAILED
/jvarkit/build.xml:836: The following error occurred while executing this line:
/jvarkit/build.xml:238: Compile failed; see the compiler error output for details.

Does anybody have the same problem?

ADD REPLY
0
Entering edit mode

What's your version of java ?

$ javac -version
javac 1.7.0_07
ADD REPLY
0
Entering edit mode

Thanks! That was the problem, I had a java version "1.6.0_32" and when using java version "1.7.0_09" compilation was successful.

ADD REPLY
0
Entering edit mode

Is there a plan to do this also for vcf.gz files?

ADD REPLY
1
Entering edit mode

it should work with vcf.gz

ADD REPLY
0
Entering edit mode

Using java 1.0.7_09 I tried to fill in the missing genotypes. I got several error messages:

[SEVERE/FixVcfMissingGenotypes] 2014-12-09 23:04:04 "For input string: ".""
java.lang.NumberFormatException: For input string: "."
        at sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:1241)
        at java.lang.Double.parseDouble(Double.java:540)
        at htsjdk.variant.variantcontext.GenotypeLikelihoods.parseDeprecatedGLString(GenotypeLikelihoods.java:250)
        at htsjdk.variant.variantcontext.GenotypeLikelihoods.fromGLField(GenotypeLikelihoods.java:81)
        at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(AbstractVCFCodec.java:724)
        at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(AbstractVCFCodec.java:134)
        at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(LazyGenotypesContext.java:123)
        at htsjdk.variant.vcf.AbstractVCFCodec.parseVCFLine(AbstractVCFCodec.java:353)
        at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:285)
        at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:263)
        at com.github.lindenb.jvarkit.util.vcf.VcfIterator.next(VcfIterator.java:63)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.doWork(FixVcfMissingGenotypes.java:219)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMain(AbstractCommandLineProgram.java:470)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMainWithExit(AbstractCommandLineProgram.java:484)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.main(FixVcfMissingGenotypes.java:328)
[INFO/FixVcfMissingGenotypes] 2014-12-09 23:04:04 "End JOB status=-1 [Tue Dec 09 23:04:04 GMT 2014] com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes done. Elapsed time: 0.02 minutes."
[SEVERE/FixVcfMissingGenotypes] 2014-12-09 23:04:04 "##### ERROR: return status = -1################"

Are all these problems related to the version of java? As I do not have privileges to install programs on a computer where I am running this job I can not test in a sort time if java version 1.0.7_09 would solve this problem.

ADD REPLY
0
Entering edit mode

no, it's a problem with your vcf file. the HTS-JDK library is not able to parse it (an old format ?). It seems that it contains a Genotype/GL with only a ".".

ADD REPLY
0
Entering edit mode

As a test, try to parse your VCF with a picard tool like SplitVCF http://broadinstitute.github.io/picard/command-line-overview.html.

ADD REPLY
0
Entering edit mode

Known problem: http://gatkforums.broadinstitute.org/discussion/4362/error-message-for-input-string-when-validating-variants

This is saying that you have "." in a genotype likelihood field, which I think shouldn't happen.

ADD REPLY
1
Entering edit mode

Thank you very much Pierre, your comments are really helpful. Missing "." genotypes likelihood is generated after the VCF files from different individuals were merged together. They are occurring on multi allelic sites with different alternative variants in different samples. Maybe a possible solution would be to keep only the biallelic sites and back fill the missing genotypes only for them. The other option would be to fill in -10 for all the missing GL sites.

ADD REPLY
0
Entering edit mode

I try to filling missing genotypes into vcf files from bam files.

Once I used the realigned-bamfiles, it works.

But I used the original bamfiles, it doesn't work.

I got the error message;

[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "Reading header for /home/User3/data/original/RB4.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "reading from /home/User3/data/annotated.sample23-haplo.eff.vcf"
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "Adding 'java.io.tmpdir' directory to the list of tmp directories"
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "Sample: PRO28-RB12"
[SEVERE/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "null"
java.lang.NullPointerException
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.doWork(FixVcfMissingGenotypes.java:209)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMain(AbstractCommandLineProgram.java:470)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMainWithExit(AbstractCommandLineProgram.java:484)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.main(FixVcfMissingGenotypes.java:328)
[INFO/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "End JOB status=-1 [Fri Jan 23 10:31:51 EST 2015] com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes done. Elapsed time: 0.52 minutes."
[SEVERE/FixVcfMissingGenotypes] 2015-01-23 10:31:51 "##### ERROR: return status = -1################"

Can you help me to solve this trouble?

Thanks!

ADD REPLY
0
Entering edit mode

Can you see the read-GROUP with "SN:PRO28-RB12"? in

samtools -H /home/User3/data/original/RB4.bam
ADD REPLY
0
Entering edit mode

No, I can't see SN:PRO28-Rb12.

PRO28-RB12 and RB4 are different samples.

$ fixvcf -f bam_lists.txt /home/User3/data/GATK/annotated.sample23-haplo.eff.vcf > haplo.test2.vcf
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:32:24 "Starting JOB at Thu Jan 22 16:32:24 EST 2015 com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes version=801e96ea74dc515bb5de8dd02f64063c0cd137aa  built=2015-01-12 15:52:14"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:32:24 "Command Line args : -f bam_lists.txt /home/User3/data/GATK/annotated.sample23-haplo.eff.vcf"


[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/PRO28-RB12.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/RB21.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/RB16.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:54 "Reading header for /home/User3/data/original/RB20.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:55 "Reading header for /home/User3/data/original/RB1.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:55 "Reading header for /home/User3/data/original/RB17.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:33:55 "Reading header for /home/User3/data/original/RB28.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:17 "Reading header for /home/User3/data/original/RB7.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "Reading header for /home/User3/data/original/RB4.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "reading from /home/jbyun/User3/data/GATK/annotated.sample23-haplo.eff.vcf"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "Adding 'java.io.tmpdir' directory to the list of tmp directories"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "Sample: PRO28-RB12"
[SEVERE/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "null"
java.lang.NullPointerException
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.doWork(FixVcfMissingGenotypes.java:209)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMain(AbstractCommandLineProgram.java:470)
        at com.github.lindenb.jvarkit.util.AbstractCommandLineProgram.instanceMainWithExit(AbstractCommandLineProgram.java:484)
        at com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes.main(FixVcfMissingGenotypes.java:328)
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "End JOB status=-1 [Thu Jan 22 16:34:35 EST 2015] com.github.lindenb.jvarkit.tools.misc.FixVcfMissingGenotypes done. Elapsed time: 2.19 minutes."
[SEVERE/FixVcfMissingGenotypes] 2015-01-22 16:34:35 "##### ERROR: return status = -1################"
ADD REPLY
0
Entering edit mode

When I used realigned BAM files, it worked.

Some of ./. are fixed and others not.

[INFO/FixVcfMissingGenotypes] 2015-01-22 16:36:17 "Sample: RB19"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:36:17 "Opening /home/User3/data/GATK/recal-realigned-s-GATK-RB19.bam"
[INFO/FixVcfMissingGenotypes] 2015-01-22 16:36:30 "done sample RB19 fixed=15 not-fixed=46 total=7991 genotypes"

What does not-fixed ones indicate?

Thanks!

ADD REPLY
0
Entering edit mode

Thanks for this excellent tool! Is there a way to speed it up, like multi-thread computing?

ADD REPLY
0
Entering edit mode

Thanks for developing this tool. Is there a way to speed it up..

ADD REPLY
0
Entering edit mode

I'm afraid no :-) you can always try to split by region and the concat the VCFs later. Or the brute force: set all no calls to hom-ref: http://lindenb.github.io/jvarkit/VcfNoCallToHomRef.html

ADD REPLY
0
Entering edit mode

Hello,

I tried running this command and got this:

[SEVERE][Launcher]There was an error in the command line arguments. Please check your parameters : Expected one or zero argument but got 2 : [bams.txt, merged_filtered_sort.recode.vcf]

What can I do to resolve this?

ADD REPLY
0
Entering edit mode
3.3 years ago
Julia • 0

Hello,

I have a problem.

[SEVERE][FixVcfMissingGenotypes]No BAM index available for bam.list 
[INFO][Launcher]fixvcfmissinggenotypes Exited with failure (-1)

but I have the .bam and .bai in the same directory.

5,9M Jul 9 12:13 genome1.bai
2,7G Jul 15 07:40 genome1.bam
5,9M Jul 9 12:17 genome2.bai
2,8G Jul 15 07:39 genome2.bam
ADD COMMENT
0
Entering edit mode

what was the command line ?

ADD REPLY
0
Entering edit mode
2.1 years ago
rafael • 0

Hello!

I'm trying to use the software FixVcfMissingGenotypes and i've been facing some issues.

command used:

java -jar /dist/fixvcfmissinggenotypes.jar -B bamzin.list Exomas_merge_DP16_3ind.vcf > teste6_ind_fixed.vcf

afterwards the program prints the following on my terminal:

[INFO][FixVcfMissingGenotypes]Count: 107,106 Elapsed: 11 seconds(68.00%) Remains: 5 seconds(32.00%) Last: chr14:20,147,053

[INFO][FixVcfMissingGenotypes]. Completed. N=159,095. That took:16 seconds

my problem is: the VCF file it creates does not update the missing variants (./.) to reference (0/0) at all (i checked using grep), and updates the DP on the missing variants to 0.

Does anybody have an idea of what i've been doing wrong?

Thanks in advance :)

ADD COMMENT
0
Entering edit mode

what the output of

bcftools query -l Exomas_merge_DP16_3ind.vcf 

and

cat bamzin.list | samtools samples
ADD REPLY
0
Entering edit mode

thanks for the reply!

bcftools query -l Exomas_merge_DP16_3ind.vcf
01GO_S2
01MAR_S1
01MG_S3

and

cat ../bamzin.list | samtools samples
.       /home/rafaeltou/homozigoty/bai/01GO_S2.bam
.       /home/rafaeltou/homozigoty/bai/01MAR_S1.bam
.       /home/rafaeltou/homozigoty/bai/01MG_S3.bam
ADD REPLY
0
Entering edit mode

Your samples are missing read groups with SM: . This should have been specified when the read were mapped. See https://gatk.broadinstitute.org/hc/en-us/articles/360035890671-Read-groups

ADD REPLY

Login before adding your answer.

Traffic: 3383 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6