Right now I am running HISAT2
on the Homo sapiens hg38 SNP
db from ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_snp.tar.gz which will produce 88 individual *.sam
files (I have 88 samples) that I will then use to create vcf
files.
Anyways, I want to get these vcf
files into a form that I can use for some of my downstream pipelines. My question, is how can I get these vcf
files into a (n= samples, m= SNPs)
dimensional data matrix (preferably in Python
or vcftools
but open to others or writing my own method)? I have seen the term genotyping matrix in my Google searches, is this what I am trying to create? Apologies if this question is naive. I planned to create my own using pandas
in Python but did not want to recreate the wheel if one already exists.
I'm using Python 3.6.1
on OSX
.
see Extracting Genotype Information From Vcf