bcftools filter by protein prediction
3
0
Entering edit mode
3.5 years ago
storm1907 ▴ 30

Hello, I have VEP annotated vcf files with following content:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  file
chr1    183937  .       G       A       58.9    PASS    CSQ=||||||||||||MODIFIER|FO538757.1|ENSG00000279928|ENST00000624431|unprocessed_pseudogene||4/4|||||;AC=1;AN=2  GT:GQ:DP:AD:VAF:PL      0/1:51:26:15,11:0.423077:58,0,51
chr1    601436  .       C       T       4.9     PASS    CSQ=||||||||||||MODIFIER|AL669831.3|ENSG00000230021|ENST00000634337|processed_transcript|4/5||404||||,||||||||||||MODIFIER|AL669831.3|ENSG00000230021|ENST00000634833|processed_transcript|3/6||317||||;AC=1;AN=2    GT:GQ:DP:AD:VAF:PL      0/1:5:26:19,7:0.269231:3,0,17 

I would like to filter out protein coding variants, but get following errors:

bcftools view -f "protein_coding" file > out
[E::bcf_write] Broken VCF record, the number of columns at chrX:152737049 does not match the number of samples (0 vs 1)
[main_vcfview] Error: cannot write to (null)

bcftools filter -i 'BIOTYPE="protein_coding"' file > aaa 
[filter.c:2491 filters_init1] Error: the tag "BIOTYPE" is not defined in the VCF header

How should I filter such variants, if the field is in CSQ field between pipes?

Thank you!

bcftools • 4.1k views
ADD COMMENT
1
Entering edit mode

Hello @storm1907, you asked many of questions in this community, which is totally fine. Though, almost none of the answers to these questions have received any upvotes or toggled an answer as accepted. Please take the time to acknowledge the effort the users have invested by upvoting helpful answers and comments. If an answer solved the issue please accept it.

ADD REPLY
0
Entering edit mode
3.5 years ago
Ram 44k

Please read the bcftools manual. filter -i is not the way to go here, nor is view -f. You need view -i. Experiment with bcftools and regular expressions in the EXPRESSION argument to -i. You will need to compare the content in INFO/CSQ and pick those records that have protein_coding in them.

ADD COMMENT
0
Entering edit mode

Thank you! However, bcftools does not recognise INFO field

bcftools view -i '%FILTER="PASS"' file > out
bcftools view -i '%INFO\n' file > out
[filter.c:2491 filters_init1] Error: the tag "%INFO" is not defined in the VCF header
ADD REPLY
0
Entering edit mode

Did you read the manual? You're using the field specification format from bcftools query in filter expressions.

ADD REPLY
0
Entering edit mode

From the manual:

Shell expansion:

Note that expressions must often be quoted because some characters have special meaning in the shell. An example of expression enclosed in single quotes which cause that the whole expression is passed to the program as intended:

bcftools view -i '%ID!="." & MAF[0]<0.01'

when doing the same with INFO field:

bcftools view -i '%INFO!="."' file > out
[filter.c:2491 filters_init1] Error: the tag "%INFO" is not defined in the VCF header
ADD REPLY
0
Entering edit mode

Oops, looks like the manual has a faulty example. Don't use %INFO, use INFO/CSQ. See any example except the one you used.

ADD REPLY
0
Entering edit mode
3.5 years ago
sbstevenlee ▴ 480

Lately I have been working with VEP-annotated VCF files a lot too, and if you are familiar with Python, I just wanted to let you know about the pyvcf and pyvep submodules I wrote. Also, your question motivated me to write a new method pyvep.filter_biotype in the 0.11.0 version:

from fuc import pyvcf, pyvep
vf = pyvcf.VcfFrame.from_file('master.vep.vcf')
filtered_vf = pyvep.filter_biotype(vf, 'protein_coding')
filtered_vf.to_file('master.vep.filtered.vcf')

The fuc package is in active development, so if you want more features regarding VEP-annotated VCF files, let me know in the comment section.

ADD COMMENT
0
Entering edit mode

Hello, I tried to visit the given link, but it does not exist. I am not very familiar with Python, and also I do work on Linux server

ADD REPLY
0
Entering edit mode

Sorry for the confusion, the link has been fixed. And I totally get your situation. I will think about devising a command line-based solution.

ADD REPLY
0
Entering edit mode

Hi, fyi, I implemented a new command called vep_query to my fuc package (0.12.0-dev branch). Below it its help message:

$ fuc vcf_vep -h
usage: fuc vcf_vep [-h] [--opposite] [--as_zero] vcf expr

This command will filter a VCF file annotated by Ensemble VEP. It essentially wraps the `pandas.DataFrame.query` method. For details on query expression, please visit the method's documentation page (https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.query.html#pandas-dataframe-query).

examples:
  $ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' > out.vcf
  $ fuc vcf_vep in.vcf 'SYMBOL != "TP53"' > out.vcf
  $ fuc vcf_vep in.vcf 'SYMBOL == "TP53"' --opposite > out.vcf
  $ fuc vcf_vep in.vcf 'Consequence in ["splice_donor_variant", "stop_gained"]' > out.vcf
  $ fuc vcf_vep in.vcf '(SYMBOL == "TP53") and (Consequence.str.contains("stop_gained"))' > out.vcf
  $ fuc vcf_vep in.vcf 'gnomAD_AF < 0.001' > out.vcf
  $ fuc vcf_vep in.vcf 'gnomAD_AF < 0.001' --as_zero > out.vcf

positional arguments:
  vcf         Ensemble VEP-annotated VCF file
  expr        query expression to evaluate

optional arguments:
  -h, --help  show this help message and exit
  --opposite  use this flag to return records that don’t meet the said criteria
  --as_zero   use this flag to treat missing values as zero instead of NaN

For your case, the command would be:

$ fuc vcf_vep in.vcf 'BIOTYPE == "protein_coding"' > out.vcf
ADD REPLY
0
Entering edit mode
3.5 years ago

use bctools +split-vep: https://samtools.github.io/bcftools/howtos/plugin.split-vep.html

I also wrote http://lindenb.github.io/jvarkit/VcfFilterSequenceOntology.html

java -jar dist/vcffilterso.jar  -A "SO:0001907"  in.vcf
ADD COMMENT
0
Entering edit mode

I can not use +split-vep plugin, as I get following issue:

The field "Consequence" is not present in INFO/CSQ: "Consequence annotations from Ensembl VEP. Format: 'Allele
ADD REPLY
0
Entering edit mode
I also tried SO tool, but did not work:

  java -jar /mnt/home/tools/jvarkit/dist/vcffilterso.jar -A "SO:0000010"   file > out
    [INFO][VcfFilterSequenceOntology]Will be using :SO:0000729(intein_containing) SO:0000010(protein_coding)
    [WARN][SnpEffPredictionParser]no INFO[EFF] or no description. This VCF was probably NOT annotated with SnpEff (old version). But it's not a problem if this tool doesn't need to access SnpEff Annotations.
    [WARN][AnnPredictionParser]no INFO[ANN] or no description This VCF was probably NOT annotated with SnpEff(ANN version) . But it's not a problem if this tool doesn't need to access SnpEff Annotations.
    [SEVERE][OnePassVcfLauncher]Line 80965: there aren't enough columns for line chrX       152737049       .       A       G       64.1    PASS    CSQ=||||||||||||LOW|MAGEA12|ENSG0000021340 (we expected 9 tokens, and saw 8 )
    htsjdk.tribble.TribbleException: Line 80965: there aren't enough columns for line chrX  152737049       .       A       G       64.1    PASS    CSQ=||||||||||||LOW|MAGEA12|ENSG0000021340 (we expected 9 tokens, and saw 8 )
            at htsjdk.variant.vcf.AbstractVCFCodec.decodeLine(AbstractVCFCodec.java:381)
            at htsjdk.variant.vcf.AbstractVCFCodec.decode(AbstractVCFCodec.java:328)
            at htsjdk.variant.vcf.VCFIteratorBuilder$VCFReaderIterator.advance(VCFIteratorBuilder.java:187)
            at htsjdk.variant.vcf.VCFIteratorBuilder$VCFReaderIterator.advance(VCFIteratorBuilder.java:162)
            at htsjdk.samtools.util.AbstractIterator.next(AbstractIterator.java:57)
            at com.github.lindenb.jvarkit.jcommander.OnePassVcfLauncher$VCFIter.next(OnePassVcfLauncher.java:78)
            at com.github.lindenb.jvarkit.jcommander.OnePassVcfLauncher$VCFIter.next(OnePassVcfLauncher.java:54)
            at com.github.lindenb.jvarkit.tools.vcffilterso.VcfFilterSequenceOntology.doVcfToVcf(VcfFilterSequenceOntology.java:470)
            at com.github.lindenb.jvarkit.jcommander.OnePassVcfLauncher.doWork(OnePassVcfLauncher.java:145)
            at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMain(Launcher.java:796)
            at com.github.lindenb.jvarkit.util.jcommander.Launcher.instanceMainWithExit(Launcher.java:959)
            at com.github.lindenb.jvarkit.tools.vcffilterso.VcfFilterSequenceOntology.main(VcfFilterSequenceOntology.java:741)
    [INFO][VcfFilterSequenceOntology]. Completed. N=80,699. That took:2 seconds
    [INFO][Launcher]vcffilterso Exited with failure (-1)
cat out
        ##contig=<ID=chrUn_GL000216v2,length=176608>
        ##contig=<ID=chrUn_GL000218v1,length=161147>
        ##contig=<ID=chrEBV,length=171823>
        ##fileDate=20201220
        ##reference=file:///home/dnanexus/GRCh38.no_alt_analysis_set.fasta
        ##source=strelka
        ##source_version=2.9.10
        ##startTime=Sun Dec 20 08:42:13 2020
        ##vcffilterso.meta=compilation:20210609092328 githash:037c703 htsjdk:2.24.1 date:20210609093555 cmd:-A SO:0000010 file
        #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  file
ADD REPLY
0
Entering edit mode

it's not a problem with the tool: your vcf is malformed : there are some FORMAT+sample(s) columns in the header but there is no genotype in the variant line.

ADD REPLY
0
Entering edit mode

OK, thank you! I will try to reannotate samples then

ADD REPLY

Login before adding your answer.

Traffic: 2031 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6