Hello, I need to collect a set of ~330,000 SNPs from a set of extended-VCF files (available here and here). The only information I have on these SNPs are there dbSNP rs#s
I've tried to do this using vcftools, but vcftools f'ed up the extended format aspects of the VCF files, and thus the output was useless.
I have been told by one of the individuals who generated these original files that I should NOT use vcftools or GATK, and I SHOULD use a simple VCF parser (SAMtools library/tabix implementation, e.g. pysam). I am fairly noob, and I don't know how to do this or most of what this means. I know some Perl, Python, and bash in case it is relevant. Can anyone help me figure out what I should be doing here? Thanks!
-Deven
By collect a set of snps do you mean grep the line? Do you have genomic positions corresponding to the SNPs in your list, or just the rsID (could be problematic depending on your dbsnp build)
If this is sequencing data, I would avoid plink. It just breaks too much...
Pretty sure I can help once I know this. Good luck.
I just have the rsIDs. The VCF files fortunately do have the rsIDs listed for most or all variants that have one.
In regards to "By collect a set of snps do you mean grep the line?" essentially yes.
Step 1 is to cull out the lines I don't need and keep the lines I do need and keep these lines unmodified (i.e., I need the VCF header lines and the lines for the 330k SNPs)
Step 2 is to filter down the SNPs based on any possible stand mis-identifications or post-mortem modifications due to ancient DNA
Steps 3 and 4 and beyond, I don't want to start thinking about yet.
OK - Just a few more things.
Are you certain the rsIDs are on the same build? The answer to that is almost certainly no, but I have to ask.
If you can be sure the rsIDs will correspond, the problem is simple. If not, we need to do it by genomic position.
All the rsIDs in my list are extracted from the same SNP array (Illumina HumanCNV370-Quad). The VCF file sets are using the Hg19 genome build as a reference. I don't know what dbSNP build the SNP chip's list comes from, and I don't know what build the rsID labels in the VCF files come from either.
How often do rsIDs move? The notation for genomic position might change, but my understanding (which very well could be wrong) is once a polymorphism at a given nucleotide is given an rsID, that rsID cannot change unless it turns out one site has been given more than one rsID