Question

Turning VCF INFO Column into a dataframe

0

Entering edit mode

5.2 years ago

basay3 • 0

Is there a quick tool to turn the data within the INFO column into a data frame? I have been using scikit-allel to do a lot of my data extractions from the .vcf file but I cannot seem to figure out a way to do the same to the INFO column. Currently I am manually writing recursive functions to pull the data and I cant help but feel there is something already out there. Thanks!

VCF Scikit-Allel SNP next-gen-sequencing • 6.1k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 5.2 years ago by basay3 • 0

0

Entering edit mode

If you are familiar with R, use vcfR, extracts INFO column into a clean dataframe.

ADD REPLY • link 5.2 years ago by venu 7.1k

0

Entering edit mode

try vcftotsv and extract columns @https://www.biostars.org/u/65669/

ADD REPLY • link 5.2 years ago by cpad0112 21k

score 4 · Answer 1 · 2020-03-21

4

Entering edit mode

5.2 years ago

liorglic ★ 1.5k

In python you can use the PyVCF package. It parses VCF files, returning INFO data as a dict, so it should be fairly simple to convert to pandas data frame.

ADD COMMENT • link 5.2 years ago by liorglic ★ 1.5k

score 0 · Answer 2 · 2022-04-06

If you're looking for a concise solution, this worked for me:

info_strings = '{"' + vcf_df.INFO.str.split(';').str.join('","').str.replace('=','":"').str.replace("\"\",", "") + '"}' 
info_df = pd.json_normalize(info_strings.apply(eval))

First it builds a pd.Series where each element is a string representation of a dictionary. To make the INFO elements look like a dictionary, you need to split by semicolons, rejoin by commas, then convert "=" to ":". To handle PBR=1,3,6,7 or other elements in the VCF, everything dictionary value is converted to a string with extra quotes you see near the symbols (convert to ints, floats, or lists later). The .str.replace("\"\",", "") bit handles possible missing values in the VCF that look like "",. Lastly, curly braces are added on the outside.

The result of the first line should look something like this:

0       {"DP":"24","VDB":"3.760647e-03","RPB":"-4.6349...
1       {"DP":"2","VDB":"5.960000e-02","AF1":"1","AC1"...
2       {"DP":"2","VDB":"7.200000e-02","AF1":"1","AC1"...

The second line evaluates the string representation, making a series of dictionary objects, then, since dictionaries are valid JSON, you can use the pd.json_normalize() to make a new column for each unique key in the dictionary.

Don't forget to convert columns of interest to the correct dtype with something similar to this:

info_df.AF1 = info_df.AF1.astype(float)

score 0 · Answer 3 · 2022-04-06

That is what I usually do.

First of all you have to take off the header of the VCF, and since each line of the header usually begins with '##' , you can take this part off from terminal with the command:

grep -v '##' your.vcf > no_header.vcf

Then you can use R studio:

data_frame<-as.data.frame(read.csv('no_header.vcf' , header=TRUE , sep='\t')

hope it helps

Good luck :)