If you're looking for a concise solution, this worked for me:
info_strings = '{"' + vcf_df.INFO.str.split(';').str.join('","').str.replace('=','":"').str.replace("\"\",", "") + '"}'
info_df = pd.json_normalize(info_strings.apply(eval))
First it builds a pd.Series where each element is a string representation of a dictionary. To make the INFO elements look like a dictionary, you need to split by semicolons, rejoin by commas, then convert "=" to ":". To handle PBR=1,3,6,7
or other elements in the VCF, everything dictionary value is converted to a string with extra quotes you see near the symbols (convert to ints, floats, or lists later). The .str.replace("\"\",", "")
bit handles possible missing values in the VCF that look like "",
. Lastly, curly braces are added on the outside.
The result of the first line should look something like this:
0 {"DP":"24","VDB":"3.760647e-03","RPB":"-4.6349...
1 {"DP":"2","VDB":"5.960000e-02","AF1":"1","AC1"...
2 {"DP":"2","VDB":"7.200000e-02","AF1":"1","AC1"...
The second line evaluates the string representation, making a series of dictionary objects, then, since dictionaries are valid JSON, you can use the pd.json_normalize() to make a new column for each unique key in the dictionary.
Don't forget to convert columns of interest to the correct dtype with something similar to this:
info_df.AF1 = info_df.AF1.astype(float)
If you are familiar with
R
, use vcfR, extracts INFO column into a clean dataframe.try vcftotsv and extract columns @https://www.biostars.org/u/65669/