Question

bcftools split-vep -- how to split the INFO column up and also assign the ensembl vep headers

0

Entering edit mode

2.5 years ago

amy__ ▴ 250

Hello,

I have seen a few answers to this but none seem to do what I would like.

I have an annotated vcf file which has the ensembl headers like this:

##VEP="v107" time="2022-09-12 19:16:50" cache="/home/c.c21087028/.vep/homo_sapiens/107_GRCh38" ensembl-io=107.a473894 ensembl-funcgen=107.0fbd7d5 ensembl=107.5f39899 ensembl-variation=107.db634f2 1000genomes="phase3" COSMIC="95" ClinVar="202201" HGMD-PUBLIC="20204" assembly="GRCh38.p13" dbSNP="154" gencode="GENCODE 41" genebuild="2014-07" gnomADe="r2.1.1" gnomADg="v3.1.2" polyphen="2.2.2" regbuild="1.0" sift="sift5.2.2"
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence annotations from Ensembl VEP. Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type|Feature|BIOTYPE|EXON|INTRON|HGVSc|HGVSp|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|DISTANCE|STRAND|FLAGS|VARIANT_CLASS|SYMBOL_SOURCE|HGNC_ID|CANONICAL|MANE_SELECT|MANE_PLUS_CLINICAL|TSL|APPRIS|CCDS|ENSP|SWISSPROT|TREMBL|UNIPARC|UNIPROT_ISOFORM|GENE_PHENO|SIFT|PolyPhen|DOMAINS|miRNA|HGVS_OFFSET|AF|AFR_AF|AMR_AF|EAS_AF|EUR_AF|SAS_AF|gnomADe_AF|gnomADe_AFR_AF|gnomADe_AMR_AF|gnomADe_ASJ_AF|gnomADe_EAS_AF|gnomADe_FIN_AF|gnomADe_NFE_AF|gnomADe_OTH_AF|gnomADe_SAS_AF|gnomADg_AF|gnomADg_AFR_AF|gnomADg_AMI_AF|gnomADg_AMR_AF|gnomADg_ASJ_AF|gnomADg_EAS_AF|gnomADg_FIN_AF|gnomADg_MID_AF|gnomADg_NFE_AF|gnomADg_OTH_AF|gnomADg_SAS_AF|MAX_AF|MAX_AF_POPS|CLIN_SIG|SOMATIC|PHENO|PUBMED|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE|TRANSCRIPTION_FACTORS|CADD_PHRED|CADD_RAW">
##CADD_PHRED=PHRED-like scaled CADD score
##CADD_RAW=Raw CADD score

I was wondering if it is possible to split these in the INFO field and also to assign the above header to the correct column.

I have tried this:

echo -e "CHROM\tPOS\tREF\tALT\t$(bcftools +split-vep -l input.vcf | cut -f 2 | tr '\n' '\t' | sed 's/\t$//')" > output.tsv
bcftools +split-vep -f '%CHROM\t%POS\t%REF\t%ALT\t%CSQ\n' -d -A tab input.vcf >> output.tsv

but it does not put the headers on, and also misses some of the above off the output.

Thanks, I hope this makes sense. Amy

bcftools ensembl-vep vep • 2.4k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 2.5 years ago by amy__ ▴ 250

score 3 · Accepted Answer · 2022-09-14

3

Entering edit mode

2.5 years ago

dariober 15k

It would help to show an example of the expected output. If it helps, recently I used this command to convert a vep-annotated vcf to TSV:

bcftools +split-vep -d -f '%CHROM %POS %ID %REF %ALT %QUAL %TYPE [%AD{0}] [%AD{1}] [%ALT_AF] [%SUM_ALT_AF] %SYMBOL %Gene %Feature %BIOTYPE %Consequence %IMPACT %Amino_acids %Codons\n' input.vcf > out.tsv

The full command which also adds the header line and makes the output in "long" format is here.

ADD COMMENT • link 2.5 years ago by dariober 15k

0

Entering edit mode

That worked great thanks!

I thought I'd also add another answer I found that worked too:

bcftools +split-vep input.vcf -f '%ID\t%CHROM\t%POS\t%REF\t%ALT\t%CSQ\n' -d  -A tab  > output.vcf

Although this one didn't add the headers in after, but I might just do that as a second step after with another bash command.

Thanks!! Amy

ADD REPLY • link 2.5 years ago by amy__ ▴ 250

0

Entering edit mode

If anyone wants to know how to keep the FORMAT column and also split that into columns you can use:

bcftools +split-vep input.vcf -f '\t%CHROM\t%ID\t%POS\t%REF\t%ALT\t%CSQ[\t%GT][\t%GQ][\t%DP][\t%MIN_DP][\t%AD][\t%VAF][\t%PL][\t%MED_DP]\n' -d -A tab > output.vcf

ADD REPLY • link 2.5 years ago by amy__ ▴ 250