Extract vcf file header and save it in Key-Value csv
2
0
Entering edit mode
6.0 years ago
tinyskinn • 0

Hi everyone,

If I have a vcf file with header like this:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA0000

How can I extract (with bcftools) and save/print these key-value pairs to a csv csv file with python. My expected output would look something like:

fileformat,VCFv4.0
fileDate,20090805

etc...

Thanks for the anticipated response

genome SNP next-gen-sequencing • 10k views
ADD COMMENT
1
Entering edit mode

Hello tinyskinn ,

what is your final goal? When working with python on vcf files, you should take one of the existing modules like pysam or cyvcf2.

fin swimmer

ADD REPLY
0
Entering edit mode

Hello Fin, my final goal is to be able to have an annotated metadata as in enter link description here. And this would require me to automate the extraction of the header Key-Value info. I am trying to implement this in enter link description here. And thank you for the suggestions.

ADD REPLY
1
Entering edit mode

I don't understand how your expected output is comma separated for the first line, but separated by a = for the second line.

Also, what have you tried to solve this issue? Why do you want to use bcftools and python, exactly? Mentioning these suggest that you at least have an idea which tools/language can be used.

ADD REPLY
0
Entering edit mode

regarding your first question, I have corrected the error. Thanks! to your second question, I actually do not have to use bcftools or python. From my search, they were the ones that came up. And my goal is explained my above answer to @finswimmer

ADD REPLY
1
Entering edit mode
6.0 years ago

use classic shell for that (not tested though):

cat file.vcf | grep '^##' | sed 's/=/\,/g' | sed 's/#//g' > header.txt
ADD COMMENT
0
Entering edit mode

It returned the original header including the ''##'. How about if the vcf is also .vcf.gz ?

ADD REPLY
0
Entering edit mode

I added an additional sed. But I didn't try it though.

ADD REPLY
0
Entering edit mode

For .gz use gzip -d -c instead of cat

ADD REPLY
0
Entering edit mode

Or just zgrep instead of gzip+grep.

ADD REPLY
0
Entering edit mode

I'd use grep '^#' instead to capture the header for the table

ADD REPLY
0
Entering edit mode
2.3 years ago
tweep • 0

This will do :

tabix -H sample.gvcf.gz 
ADD COMMENT

Login before adding your answer.

Traffic: 1700 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6