Question

extracting depth info from vcf

0

Entering edit mode

7.0 years ago

victoria_aleks ▴ 60

Dear all, I need to know the "coverage" of sites in vcf files, but the files i am working with do not contain DP info, only AD. Most of the tools for vcf parcing use DP parameter. I understand that I can use AD to estimate depth, too. Is there any known ways of how to? Will be happy to any help:)

vcf AD depth • 6.1k views

ADD COMMENT • link 7.0 years ago by victoria_aleks ▴ 60

Pierre Lindenbaum · Answer 1 · 2017-11-16

0

Entering edit mode

7.0 years ago

Pierre Lindenbaum 164k

Use gatk variantAnnotator https://software.broadinstitute.org/gatk/gatkdocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php with Coverage https://software.broadinstitute.org/gatk/gatkdocs/3.8-0/org_broadinstitute_gatk_tools_walkers_annotator_Coverage.php

 ls *.bam > bam.list
 java -jar GenomeAnalysisTK.jar \
   -R reference.fasta \
   -T VariantAnnotator \
   -I bam.list \
   -V input.vcf \
   -o output.vcf \
   -A Coverage \
   -L input.vcf

ADD COMMENT • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

and what if i dont have neither bams nor reference fasta? I only downloaded vcf files from the source

ADD REPLY • link 7.0 years ago by victoria_aleks ▴ 60

0

Entering edit mode

and what if i dont have neither bams

you cannot get the depth without the bams...

nor reference fasta?

a reference sequence is always needed to run gatk. https://gatkforums.broadinstitute.org/gatk/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

ah, sorry, now I see why you're asking: you already have AD...

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Ok here is a solution using vcfilterjdk http://lindenb.github.io/jvarkit/VcfFilterJdk.html

First I use awk to insert the new VCF header for DP . Then I loop over the genotypes and I create a new variant.

 curl -s "http://www.icbi.at/svnsimplex/CommonBioCommands/tags/simplex-1.0/CommonBioCommands/testdata/vcf/AMS1_converted_filtered_short_chr1.vcf" |\
 awk '/^#CHROM/ {printf("##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"\">\n");} {print;}' | \
java -jar dist/vcffilterjdk.jar  -e 'List<Genotype> gl = variant.getGenotypes().stream().map(G->{int ad[]=G.getAD();if(ad==null || ad.length==0) return G; int c= Arrays.stream(ad).sum(); return new GenotypeBuilder(G).DP(c).make();}).collect(Collectors.toList());return new VariantContextBuilder(variant).genotypes(gl).make();'

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

thank you very much. unfortunately vcffilterjdk constantly gives errors, in my hands :) Is it possible just write a home-made script to extract AD values for each line in vcf-file? I read that its better not to mess with vcf parcing and rather use the ready software. but right now it seems easier just code something by our own...

ADD REPLY • link 7.0 years ago by victoria_aleks ▴ 60

0

Entering edit mode

unfortunately vcffilterjdk constantly gives errors, in my hands

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

:) basically my question was: does it worth it to write a script myself or better not to mess? anyway, the error is: 2 errors

[SEVERE][InMemoryCompiler]compile.call() failed for custom class VcfFilterJdkCustom887993132 java.lang.RuntimeException: compile.call() failed for custom class VcfFilterJdkCustom887993132 at com.github.lindenb.jvarkit.lang.InMemoryCompiler.compileClass(InMemoryCompiler.java:230) at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk$CtxWriterFactory.initialize(VcfFilterJdk.java:509) at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.doWork(VcfFilterJdk.java:593) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1183) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1341) at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.main(VcfFilterJdk.java:612) [SEVERE][VcfFilterJdk]Cannot compile custom class VcfFilterJdkCustom887993132 java.lang.RuntimeException: Cannot compile custom class VcfFilterJdkCustom887993132 at com.github.lindenb.jvarkit.lang.InMemoryCompiler.compileClass(InMemoryCompiler.java:234) at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk$CtxWriterFactory.initialize(VcfFilterJdk.java:509) at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.doWork(VcfFilterJdk.java:593) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1183) at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1341) at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.main(VcfFilterJdk.java:612) Caused by: java.lang.RuntimeException: compile.call() failed for custom class VcfFilterJdkCustom887993132 at com.github.lindenb.jvarkit.lang.InMemoryCompiler.compileClass(InMemoryCompiler.java:230) ... 5 more [SEVERE][VcfFilterJdk]Cannot initialize [INFO][Launcher]vcffilterjdk Exited with failure (-1) thank you:)

ADD REPLY • link updated 7.0 years ago by Pierre Lindenbaum 164k • written 7.0 years ago by victoria_aleks ▴ 60

0

Entering edit mode

does it worth it to write a script myself or better not to mess?

I do understand that.

But it's a pity for the developer (aka 'me') to see that his tool is not used just because 'it doesn't work'. :-)

It's hard to debug without the context. what is exactly your command line please ?

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

i see:) honestly i repeated exactly what you wrote. except, first i ran awk command (i checked the resulting file, and DP appeared there. only: i do not understand why do I need incorporate DP in the header if I want to use AD anyway?). then I set this awk-mdified file as the input for vcffilterjdk:

vcffilterjdk awk.test -e 'List<Genotype> gl = variant.getGenotypes().stream().map(G->{int ad[]=G.getAD();if(ad==null || ad.length==0) return G; int c= Arrays.stream(ad).sum(); return new GenotypeBuilder(G).DP(c).make();}).collect(Collectors.toList());return new VariantContextBuilder(variant).genotypes(gl).make();'

I also tried to use -f instead -e; the error was:

[SEVERE][VcfFilterJdk]List<Genotype> (No such file or directory)
java.io.FileNotFoundException: List<Genotype> (No such file or directory)
    at java.io.FileInputStream.open0(Native Method)
    at java.io.FileInputStream.open(FileInputStream.java:195)
    at java.io.FileInputStream.<init>(FileInputStream.java:138)
    at htsjdk.samtools.util.IOUtil.slurp(IOUtil.java:999)
    at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk$CtxWriterFactory.initialize(VcfFilterJdk.java:436)
    at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.doWork(VcfFilterJdk.java:593)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:1183)
    at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:1341)
    at com.github.lindenb.jvarkit.tools.vcffilterjs.VcfFilterJdk.main(VcfFilterJdk.java:612)
[SEVERE][VcfFilterJdk]Cannot initialize
[INFO][Launcher]vcffilterjdk Exited with failure (-1)

(in case, i tried to pipe awk into vcffilterjdk, as you have shown - didnt help)

I hope it gives some useful info

ADD REPLY • link updated 7.0 years ago by Pierre Lindenbaum 164k • written 7.0 years ago by victoria_aleks ▴ 60

1

Entering edit mode

why do I need incorporate DP in the header if I want to use AD anyway?

because my script produces a new attribute DP and it must be declared in the VCF header if it wasn't declared before.

 new GenotypeBuilder(G).DP(c).make()

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

you say

vcffilterjdk awk.test -e '...'

it should be something like

   java -jar dist/vcffilterjdk.jar   -e '...'  file.vcf

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

its a "wrapper", as our IT guy likes to use short words:) so, "java -jar ..." is included in "vcffilterjdk"

ADD REPLY • link 7.0 years ago by victoria_aleks ▴ 60

0

Entering edit mode

but I never use it and your messages show me that there is an error in the script (thanks for pointing this out!) the script ends with

$*

while I think I should I written it as (with double quotes)

"$@"

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

you mean that you want to correct the script?

ADD REPLY • link 7.0 years ago by victoria_aleks ▴ 60

0

Entering edit mode

I've just fixed this https://github.com/lindenb/jvarkit/commit/7f811475a7648d24289702f49d53c89fb53761c9#diff-b67911656ef5d18c4ae36cb6741b7965R133

but anyway, I've never said you should use this wrapper, use the java -jar ... syntax please.

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

1

Entering edit mode

now it worked:) as I understand the script summated AD for ref and alt and wrote it into DP parameter. but now i try to use vcftools to extract this DP (for example, --depth or --geno-depth) and it reports on 'missing entries'. and i still cannot use gatk, i think, as i have only vcf files, no fasta or bam

is there something that can be done here?:)

ADD REPLY • link 7.0 years ago by victoria_aleks ▴ 60

0

Entering edit mode

I'm glad it's working now !

the script summated AD for ref and alt and wrote it into DP parameter. but now i try to use vcftools to extract this DP (for example)

piping my example into 'bcftools view' produces no error on my side

you can use another program bioalcidaejdk similar to vcffilterjdk:

java -jar dist/bioalcidaejdk.jar -e 'stream().forEach(V->{V.getGenotypes().stream().forEach(G->{println(V.getContig()+"\t"+V.getStart()+"\t"+G.getSampleName()+"\t"+G.getAlleles()+"\t"+G.getDP());});});' input.vcf

ADD REPLY • link 7.0 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

thank you a lot:) it all finally seem to work now!

ADD REPLY • link 7.0 years ago by victoria_aleks ▴ 60

0

Entering edit mode

but also: can I extract exactly AD values, not only DP?? I want to know how many reads contain ref and how many contain alt nucleotides.