Question

parsing vcf file

0

Entering edit mode

6.8 years ago

miss • 0

I have VCF file and part of it looks like this:

;Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;

I need to extract Gene.refGene=NONE,DDX11L1 which is between semicolons, I also need to extract ExonicFunc.refGene=. and AAChange.refGene=. which are all also between semicolons.

I tried to do it like this:

import sys
import re


def parse_vcf(vcf_file):
    pattern=re.compile(r'"([^;]*)"' , 'Gene.refGene')
    f=open(vcf_file , 'r')
    for line in f:
        if pattern.search(line):
            continue
    return

if __name__ == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)

but it is not working. thank you for your suggestions.``

vcf python • 7.0k views

ADD COMMENT • link updated 2.7 years ago by platon.work • 0 • written 6.8 years ago by miss • 0

1

Entering edit mode

6.8 years ago

ATpoint 85k

You want to extract from column 8, so the INFO column right? If so, first get vcftools, and then use vcf-query:

vcf-query -f '%INFO/Gene.refGene\n' in.vcf

Same goes for the other information you want. \n is the delimiter in the output, and can be changed into any delim you want.

ADD COMMENT • link 6.8 years ago by ATpoint 85k

0

Entering edit mode

Thank you so much, it is working but sadly I have to do it with python script..

ADD REPLY • link 6.8 years ago by miss • 0

1

Entering edit mode

Then maybe have a look at PyVCF

ADD REPLY • link 6.8 years ago by ATpoint 85k

1

Entering edit mode

6.8 years ago

erwan.scaon ▴ 950

If you don't stick with Python, I'll suggest awk :

echo ';Gene.refGene=NONE,DDX11L1;GeneDetail.refGene=dist\x3dNONE\x3bdist\x3d1826;ExonicFunc.refGene=.;AAChange.refGene=.;' > test.txt;
awk -F ";" '{print $2, $4, $5}' test.txt;

Gene.refGene=NONE,DDX11L1 ExonicFunc.refGene=. AAChange.refGene=.

ADD COMMENT • link 6.8 years ago by erwan.scaon ▴ 950

0

Entering edit mode

Thank you for suggestion, that way it would be much easier :)

ADD REPLY • link 6.8 years ago by miss • 0

0

Entering edit mode

2.7 years ago

platon.work • 0

The task can be solved with the high-perf-bio toolkit.

Upload your VCFs to DB, specifying only the source directory:

python create.py -S /path/to/biostars_299866_VCFs

Create a file with query in a separate directory (for example, /path/to/biostars_299866_queries/parse_INFO.txt). Because in your case there are dots in the field names, and this conflicts with MongoDB's dot notation, the query will unfortunately be cumbersome:

{'$expr': {'$and': [{'$eq': [{'$getField': {'field': 'Gene.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, ['NONE', 'DDX11L1']]}, {'$eq': [{'$getField': {'field': 'ExonicFunc.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}, {'$eq': [{'$getField': {'field': 'AAChange.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}]}}

If you want to run the three queries separately, split them up into different lines:

{'$expr': {'$eq': [{'$getField': {'field': 'Gene.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, ['NONE', 'DDX11L1']]}}

{'$expr': {'$eq': [{'$getField': {'field': 'ExonicFunc.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}}

{'$expr': {'$eq': [{'$getField': {'field': 'AAChange.refGene', 'input': {'$arrayElemAt': ['$INFO', 0]}}}, '.']}}

Perform the query(-ies):

python query.py -S /path/to/biostars_299866_queries -D biostars_299866_VCFs -T /path/to/out

ADD COMMENT • link 2.7 years ago by platon.work • 0

score 6 · Accepted Answer · 2018-02-21

import sys
import re

def parse_vcf(vcf_file):
    ###Open your file
    with open(vcf_file, 'r') as vcf_f:
        for line in vcf_f:
            ###Skip metadata lines
            if line[0] != '#':
                ###Split the line by "tab" to keep info field (for me it's the 8th column, so choose the index = 7, I don't remember if it's always the 8th, change this if needed)
                info_field_line = line.split("\t")[7]
                ###Split the info line by ";"
                info_field_line_array = info_field_line.split(";")
                ###For each line of your VCF, create a dictionnary with this array key : info, value : value of this info
                dict_info={}
                for i in info_field_line_array:
                    ###Just looking for line with "=" character (as key = value)
                    if "=" in i:
                        ###Left from equal sign is key (Gene.refGene, ExonicFunc.refGene...)
                        key = i.split("=")[0]
                        ###Right from equal sign is value (RBL1,synonymous_SNV...)
                        value = i.split("=")[1]
                        ###Put them in a dictionnary
                        dict_info[key]=value
                ###After dictionnary creation, you can have an access to all your features (key), just by their name as follow
                my_GeneRefGene = dict_info['Gene.refGene']
                my_ExonicFuncRefGene = dict_info['ExonicFunc.refGene']
                my_AAChangeRefGene = dict_info['AAChange.refGene']

                print(my_GeneRefGene)
                print(my_ExonicFuncRefGene)
                print(my_AAChangeRefGene)
                ###This is the result for the first line, you can save the data in array to use it later or process line by line as you wish

if __name__ == '__main__':
    vcf=sys.argv[1]
    parse_vcf(vcf)

score 3 · Accepted Answer · 2018-02-21

3

Entering edit mode

6.8 years ago

finswimmer 16k

First:

but it is not working

isn't a good error description.

Second:

Some people, when confronted with a problem, think “I know, I'll use regular expressions.” Now they have two problems.

:)

I would do something like this:

split each vcf line by the column delimiter (tab)
take the column that contains the information you like to extract and split it by ";"
take the result of (2) and split it by "=". Store it in a dictionary. So you can just return result["Gene.refGene"] or any other value you like for the line.

fin swimmer

ADD COMMENT • link 6.8 years ago by finswimmer 16k

0

Entering edit mode

Thank you very much and sorry for bad error description

ADD REPLY • link 6.8 years ago by miss • 0