Hello! I want to make a VCF file with a header line syntax like "#CHROM POS REF ALT". Is it possible to create such a VCF file from scratch with python?
Thanks in advance
Hello! I want to make a VCF file with a header line syntax like "#CHROM POS REF ALT". Is it possible to create such a VCF file from scratch with python?
Thanks in advance
Assuming you're using Python, you can start with a template like so (make sure your chromosome lengths are correct for your assembly):
##fileformat=VCFv4.1
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chrM,length=16571>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
#CHROM POS ID REF ALT QUAL FILTER INFO Path
%CHROMOSOME %POSITION %ID %REF %ALT . . .
And then read in this file row by row. Keep the each line in a list or something, extract the last line as a "template" and remove it from your list. Parse out the template for the placeholders, then, read in your variant data as a list of as a variant tuples. Join them up with the placeholders and substitute them. Stub example:
output = variant_template # Variant template is the last line as a string
PLACEHOLDERS = ["%"+X for X in "CHROMOSOME,POSITION,REF,ALT".split(",")] # Placeholders are what you replace. You could also just split the last row you extracted from the template file.
for x,y in zip(PLACEHOLDERS, variant_tuple): # Pair up placeholders and variant data (assuming it's ordered in the same way.)
output = output.replace(x,y) # Replace text
Append output to the file. Do this for each variant.
A vcf
file is a plain text file, that follow the rules by the specification for a valid vcf. As long as you take care of these rules, you can create this file how ever you want.
Be careful: There are lot of tools out there, that are satisfied, if the vcf
contain just one header line, holding the column names: #CHROM POS ID REF ALT QUAL FILTER INFO
This is not enough for a "real" valid vcf
. For this the header must also include:
##fileformat=VCFv4.3
##contig=<ID=chr1,length=249250621>
INFO
or FORMAT
column, e.g.:
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
If you consider this right from the start, you will not have any problems using different vcf
tools later. Especially bcftools is very strict about the header values.
When working with python, you could think about using one of the available modules to handle and create vcf file like pysam or cyvcf.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
you can do it easily in pandas package.
Certainly. But what advantage are you going to gain by doing that? Are you trying to simulate data?
I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.
yes, it is possible in Python, Perl, Java, etc ;)
Please extend what exactly are you trying to do.
I want to create a header that allows me to save some data that is not in the template CSV. So i want either to create a new CSV that i can use to save those fields or to manipulate a template in order to add those.