Question

Filtering VCF with python

0

Entering edit mode

6.8 years ago

miss • 0

Hi, I extracted informations that I need from my vcf file and one line of it looks like this:

X 126165205 . T G DCAF12L2 synonymous_SNV p.P240P 0.4172 0.4089 0.543046 1/1 25 0 25 1/1 20 0 20

the line represents respectively: chromosome, position, ID, ref, alt, gene name, exonic function, AAchange (just last value), gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all, GT, DP, RD, AD first in normal then tumor. Now I need to make a python script which will take as an input this vcf file and a list of mutations (exonic function) that i have in file. Based on which mutation I want it will give me just that line of file. Then I need to put threshold which will be lower then a maximum number that I have among gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all. Sum (depth) of RD and AD, ratio of RD and AD and block (tumor or normal). Script has to be general so I can change variables.. Thank you in advance

vcf python • 6.7k views

ADD COMMENT • link updated 6.8 years ago by Kevin Blighe 88k • written 6.8 years ago by miss • 0

1

Entering edit mode

what have you written so far ?

ADD REPLY • link 6.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

You can rely on commented python scripts made in C: parsing vcf file and A: VCF file help to make your own VCF parser.

ADD REPLY • link 6.8 years ago by Bastien Hervé 6.0k

0

Entering edit mode

This sounds like homework, but PyVCF will help you parse VCF files and get the info you need.

ADD REPLY • link 6.8 years ago by jared.andrews07 ★ 18k

score 3 · Answer 1 · 2018-03-11

You may take inspiration from this Python script that I wrote last year. This just splits multi-allelic sites in the same way as bcftools norm -m-any. Take a look at what it does and then edit it to suit your own needs.

Execute it as

python Split.py [Input VCF] [Output VCF]

.

#Name:          Kevin Blighe
#Email:         NA
#Date:          20th July 2017
#Function(s):   Splits multiallelels

import sys
import os

#Check the number of command line arguments
if not len(sys.argv)==3:
    print "\nError:\tincorrect number of command-line arguments"
    print "Syntax:\tSplit.py [Input VCF] [Output VCF]\n"
    sys.exit()

if sys.argv[1]==sys.argv[2]:
    #boolOverwrite = raw_input("\nInput file is the same as the output file - overwrite input file (sim/nao)?\n")
    print "Error:\tInput file is the same as the output file - choose a different output file\n"
    sys.exit()

#File input
fileInput = open(sys.argv[1], "r")

#File output for split multi-alleles
fileOutput = open(sys.argv[2], "w")

#Loop through each line in the input file, and split multiallelic sites
print ("Splitting multi-allelic sites...")
for strLine in fileInput:
    #Strip the endline character from each input line
    strLine = strLine.rstrip("\n")

    #The '#' character in VCF format indicates that the line is a header. Ignore these and just output to the new file
    if strLine.startswith("#"):
        fileOutput.write(strLine + "\n")
    else:
        #Split the tab-delimited line into an array
        strArray = [splits for splits in strLine.split("\t") if splits is not ""]

        #Check first if it's multiallelic
        #Multi-allelic variants will have 2 calls in the VAR field, separated by a comma (',')
        if "," in strArray[4]:
            strVars = [splits for splits in strArray[4].split(",") if splits is not ""]
            iNumMultialleles = len(strVars)

            for i in range(0, (iNumMultialleles)):
                strArray[4] = strVars[i]
                fileOutputTemp.write("\t".join(strArray) + "\n")
        else:
            fileOutput.write("\t".join(strArray) + "\n")
print ("Done.")

#Close the files
fileInput.close()
fileOutput.close()