Turning VCF INFO Column into a dataframe
3
0
Entering edit mode
4.8 years ago
basay3 • 0

Is there a quick tool to turn the data within the INFO column into a data frame? I have been using scikit-allel to do a lot of my data extractions from the .vcf file but I cannot seem to figure out a way to do the same to the INFO column. Currently I am manually writing recursive functions to pull the data and I cant help but feel there is something already out there. Thanks!

VCF Scikit-Allel SNP next-gen-sequencing • 5.6k views
ADD COMMENT
0
Entering edit mode

If you are familiar with R, use vcfR, extracts INFO column into a clean dataframe.

ADD REPLY
0
Entering edit mode

try vcftotsv and extract columns @https://www.biostars.org/u/65669/

ADD REPLY
4
Entering edit mode
4.8 years ago
liorglic ★ 1.5k

In python you can use the PyVCF package. It parses VCF files, returning INFO data as a dict, so it should be fairly simple to convert to pandas data frame.

ADD COMMENT
0
Entering edit mode
2.7 years ago

If you're looking for a concise solution, this worked for me:

info_strings = '{"' + vcf_df.INFO.str.split(';').str.join('","').str.replace('=','":"').str.replace("\"\",", "") + '"}' 
info_df = pd.json_normalize(info_strings.apply(eval))

First it builds a pd.Series where each element is a string representation of a dictionary. To make the INFO elements look like a dictionary, you need to split by semicolons, rejoin by commas, then convert "=" to ":". To handle PBR=1,3,6,7 or other elements in the VCF, everything dictionary value is converted to a string with extra quotes you see near the symbols (convert to ints, floats, or lists later). The .str.replace("\"\",", "") bit handles possible missing values in the VCF that look like "",. Lastly, curly braces are added on the outside.

The result of the first line should look something like this:

0       {"DP":"24","VDB":"3.760647e-03","RPB":"-4.6349...
1       {"DP":"2","VDB":"5.960000e-02","AF1":"1","AC1"...
2       {"DP":"2","VDB":"7.200000e-02","AF1":"1","AC1"...

The second line evaluates the string representation, making a series of dictionary objects, then, since dictionaries are valid JSON, you can use the pd.json_normalize() to make a new column for each unique key in the dictionary.

Don't forget to convert columns of interest to the correct dtype with something similar to this:

info_df.AF1 = info_df.AF1.astype(float)
ADD COMMENT
0
Entering edit mode
2.7 years ago

That is what I usually do.

First of all you have to take off the header of the VCF, and since each line of the header usually begins with '##' , you can take this part off from terminal with the command:

grep -v '##' your.vcf > no_header.vcf

Then you can use R studio:

data_frame<-as.data.frame(read.csv('no_header.vcf' , header=TRUE , sep='\t')

hope it helps

Good luck :)

ADD COMMENT

Login before adding your answer.

Traffic: 2281 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6