I have a test vcf file and I want to read and extract AD and DP values from the sample column of 2 different populations. I am trying this:
import os
import pandas as pd
def read_vcf(path1):
with open(path1, 'r') as f:
lines = [l for l in f if not l.startswith('##')]
return pd.read_csv(
io.StringIO(''.join(lines)),
dtype={'#CHROM': str, 'POS': int, 'ID': str, 'REF': str, 'ALT': str,
'QUAL': str, 'FILTER': str, 'INFO': str},
sep='\t'
).rename(columns={'#CHROM': 'CHROM'})
path1 = "C://Users//USER//Desktop//anas/VCFs_1/test_1.vcf"
file =read_vcf(path1)
pop1 = file[["NEN_001","NEN_003","NEN_200","NEN_300","LAB_004","LAB_300","LAB_400","LAB_500"]]
pop2 = file[["ROT_004","ROT_006", "ROT_007","ROT_013" ,"SKN_001" ,"SKN_002","SKN_005","SKN_008"]]
First, I need to extract AD and then DP from pop1 and then I need to divide AD by DP to get AF1 values, after that, I need to do the same process to calculate AF2. After getting both AF1 and AF2 I need to use the AF1 - AF2 formula to calculate the allele frequency difference.
Can anyone guide me?
I would actually just use the pysam package instead of writing your own parser. The VCF format is very complicated and it is not recommended to parse the information yourself. It seems like sbstevenlee 's package also uses pysam in the background to parse the VCF files so that's a good thing in regards to his package.
You could also use bcftools' query command to get a tab delimited output and parse that with python:
Note that this is just the first 3 rows