Filtering VCF with python
1
0
Entering edit mode
6.7 years ago
miss • 0

Hi, I extracted informations that I need from my vcf file and one line of it looks like this:

X 126165205 . T G DCAF12L2 synonymous_SNV p.P240P 0.4172 0.4089 0.543046 1/1 25 0 25 1/1 20 0 20

the line represents respectively: chromosome, position, ID, ref, alt, gene name, exonic function, AAchange (just last value), gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all, GT, DP, RD, AD first in normal then tumor. Now I need to make a python script which will take as an input this vcf file and a list of mutations (exonic function) that i have in file. Based on which mutation I want it will give me just that line of file. Then I need to put threshold which will be lower then a maximum number that I have among gnomAD_genome_ALL, ExAC_ALL, 1000g2015aug_all. Sum (depth) of RD and AD, ratio of RD and AD and block (tumor or normal). Script has to be general so I can change variables.. Thank you in advance

vcf python • 6.6k views
ADD COMMENT
1
Entering edit mode

what have you written so far ?

ADD REPLY
0
Entering edit mode

You can rely on commented python scripts made in C: parsing vcf file and A: VCF file help to make your own VCF parser.

ADD REPLY
0
Entering edit mode

This sounds like homework, but PyVCF will help you parse VCF files and get the info you need.

ADD REPLY
3
Entering edit mode
6.7 years ago

You may take inspiration from this Python script that I wrote last year. This just splits multi-allelic sites in the same way as bcftools norm -m-any. Take a look at what it does and then edit it to suit your own needs.

Execute it as

python Split.py [Input VCF] [Output VCF]

.

#Name:          Kevin Blighe
#Email:         NA
#Date:          20th July 2017
#Function(s):   Splits multiallelels

import sys
import os

#Check the number of command line arguments
if not len(sys.argv)==3:
    print "\nError:\tincorrect number of command-line arguments"
    print "Syntax:\tSplit.py [Input VCF] [Output VCF]\n"
    sys.exit()

if sys.argv[1]==sys.argv[2]:
    #boolOverwrite = raw_input("\nInput file is the same as the output file - overwrite input file (sim/nao)?\n")
    print "Error:\tInput file is the same as the output file - choose a different output file\n"
    sys.exit()

#File input
fileInput = open(sys.argv[1], "r")

#File output for split multi-alleles
fileOutput = open(sys.argv[2], "w")

#Loop through each line in the input file, and split multiallelic sites
print ("Splitting multi-allelic sites...")
for strLine in fileInput:
    #Strip the endline character from each input line
    strLine = strLine.rstrip("\n")

    #The '#' character in VCF format indicates that the line is a header. Ignore these and just output to the new file
    if strLine.startswith("#"):
        fileOutput.write(strLine + "\n")
    else:
        #Split the tab-delimited line into an array
        strArray = [splits for splits in strLine.split("\t") if splits is not ""]

        #Check first if it's multiallelic
        #Multi-allelic variants will have 2 calls in the VAR field, separated by a comma (',')
        if "," in strArray[4]:
            strVars = [splits for splits in strArray[4].split(",") if splits is not ""]
            iNumMultialleles = len(strVars)

            for i in range(0, (iNumMultialleles)):
                strArray[4] = strVars[i]
                fileOutputTemp.write("\t".join(strArray) + "\n")
        else:
            fileOutput.write("\t".join(strArray) + "\n")
print ("Done.")

#Close the files
fileInput.close()
fileOutput.close()
ADD COMMENT

Login before adding your answer.

Traffic: 1960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6