parsing vcf file
5
0
Entering edit mode
6.8 years ago
miss • 0

I have VCF file and part of it looks like this:

;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;

I need to extract Gene.refGene=NONE,DDX11L1 which is between semicolons, I also need to extract ExonicFunc.refGene=. and AAChange.refGene=. which are all also between semicolons.

I tried to do it like this:

import sys
import re


def parse_vcf(vcf_file):
    pattern=re.compile(r'"([^;]*)"' , 'Gene.refGene')
    f=open(vcf_file , 'r')
    for line in f:
        if pattern.search(line):
            continue
    return

if __name__ == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)

but it is not working. thank you for your suggestions.``

vcf python • 6.9k views
ADD COMMENT
6
Entering edit mode
6.8 years ago
import sys
import re

def parse_vcf(vcf_file):
    ###Open your file
    with open(vcf_file, 'r') as vcf_f:
        for line in vcf_f:
            ###Skip metadata lines
            if line[0] != '#':
                ###Split the line by "tab" to keep info field (for me it's the 8th column, so choose the index = 7, I don't remember if it's always the 8th, change this if needed)
                info_field_line = line.split("\t")[7]
                ###Split the info line by ";"
                info_field_line_array = info_field_line.split(";")
                ###For each line of your VCF, create a dictionnary with this array key : info, value : value of this info
                dict_info={}
                for i in info_field_line_array:
                    ###Just looking for line with "=" character (as key = value)
                    if "=" in i:
                        ###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
                        key = i.split("=")[0]
                        ###Right from equal sign is value (RBL1,synonymous_SNV...)
                        value = i.split("=")[1]
                        ###Put them in a dictionnary
                        dict_info[key]=value
                ###After dictionnary creation, you can have an access to all your features (key), just by their name as follow
                my_GeneRefGene = dict_info['Gene.refGene']
                my_ExonicFuncRefGene = dict_info['ExonicFunc.refGene']
                my_AAChangeRefGene = dict_info['AAChange.refGene']

                print(my_GeneRefGene)
                print(my_ExonicFuncRefGene)
                print(my_AAChangeRefGene)
                ###This is the result for the first line, you can save the data in array to use it later or process line by line as you wish

if __name__ == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)
ADD COMMENT
0
Entering edit mode

Thank you so much it is perfect! And I understand, thank you for comments also!!

ADD REPLY
3
Entering edit mode
6.8 years ago

First:

but it is not working

isn't a good error description.

Second:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

:)

I would do something like this:

  1. split each vcf line by the column delimiter (tab)
  2. take the column that contains the information you like to extract and split it by ";"
  3. take the result of (2) and split it by "=". Store it in a dictionary. So you can just return result["Gene.refGene"] or any other value you like for the line.

fin swimmer

ADD COMMENT
0
Entering edit mode

Thank you very much and sorry for bad error description

ADD REPLY
1
Entering edit mode
6.8 years ago
ATpoint 85k

You want to extract from column 8, so the INFO column right? If so, first get vcftools, and then use vcf-query:

vcf-query -f '%INFO/Gene.refGene\n' in.vcf

Same goes for the other information you want. \n is the delimiter in the output, and can be changed into any delim you want.

ADD COMMENT
0
Entering edit mode

Thank you so much, it is working but sadly I have to do it with python script..

ADD REPLY
1
Entering edit mode

Then maybe have a look at PyVCF

ADD REPLY
1
Entering edit mode
6.8 years ago
erwan.scaon ▴ 950

If you don't stick with Python, I'll suggest awk :

echo ';Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;' > test.txt;
awk -F ";" '{print $2, $4, $5}' test.txt;

Gene.refGene=NONE,DDX11L1 ExonicFunc.refGene=. AAChange.refGene=.

ADD COMMENT
0
Entering edit mode

Thank you for suggestion, that way it would be much easier :)

ADD REPLY
0
Entering edit mode
2.7 years ago

The task can be solved with the high-perf-bio toolkit.

Upload your VCFs to DB, specifying only the source directory:

python create.py -S /path/to/biostars_299866_VCFs

Create a file with query in a separate directory (for example, /path/to/biostars_299866_queries/parse_INFO.txt). Because in your case there are dots in the field names, and this conflicts with MongoDB's dot notation, the query will unfortunately be cumbersome:

{'$expr': {'$and': [{'$eq': [{'$getField': {'field': 'Gene.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, ['NONE', 'DDX11L1']]}, {'$eq': [{'$getField': {'field': 'ExonicFunc.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}, {'$eq': [{'$getField': {'field': 'AAChange.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}]}}

If you want to run the three queries separately, split them up into different lines:

{'$expr': {'$eq': [{'$getField': {'field': 'Gene.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, ['NONE', 'DDX11L1']]}}

{'$expr': {'$eq': [{'$getField': {'field': 'ExonicFunc.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}}

{'$expr': {'$eq': [{'$getField': {'field': 'AAChange.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}}

Perform the query(-ies):

python query.py -S /path/to/biostars_299866_queries -D biostars_299866_VCFs -T /path/to/out
ADD COMMENT

Login before adding your answer.

Traffic: 2307 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6