Fair warning, I am fairly noob at dealing with nuclear NGS data.
Background: I am molecular anthropology grad student. A little over a year ago, I got back a Illumina HiSeq 2000 data from 90 mitochondrial-enriched libraries. Another grad student in the lab got back the same kind of data from 92 NRY-enriched libraries, with some sample overlap. I now have 171 bam files that I have aligned to Hg19.
I am now trying to see what I can do with the "junk" data (i.e., the autosomal + X data). I want to see if there is enough good data to do some variant calling. I have SNP data from 64 samples at ~330,000 rs ids (the data is from an old set of HumanCNV370-Quads from 2008, I don't have the genomic coordinates). (there is some overlap between the genotyped individuals and the sequence). I was wondering, if anyone can give me some help/advice/suggestions.
I need to convert the BAM files to VCFs, get rsIDs into the VCF files, filter the VCF files based on quality and type (I am only interested in SNPs, not indels, microsat variation, etc.), and then see how much overlap there is between the actual SNP data.
I have an idea of how to convert the BAM files to VCF files, but beyond that I am lost.
Thanks!
-Deven
Yes. GKNO is a wrapper for the variant calling programs. You can check out the different variant callers while you wait to get the pipeline setup. Some of the more popular tools are: samtools and GATK. You can also have a look at this paper which will help you get aquainted with the methods.
So I've read through the article and supplemental, and I am still sufficiently confused. I know (or have an idea of) what tools are out there, what I am trying to figure out how to actually use them.
I have already used samtools, bcftools, and vcftools to get my BAM files into VCF files following this guide (http://ged.msu.edu/angus/tutorials-2012/snp_tutorial.html), and I was able to figure out how to filter out the non-autosomal sites (i.e., X, Y, & mt)
My concerns are now, how do I filter out all non-SNPs, how do I filter out low quality sites, and how do get the rsIDs for whatever's left in.