I have three SNP data set, hapmapCHBb37.txt by SNP calling from original .CEL file of HapMap, kgCHBb37.txt from 1000 genomes VCF files and sampleCHBb37.txt from our whole sequencing data(15x, 50 sample size). But there're some discordance among these three data set. Take the HapMap data as the reference SNP, there're some alleles error(A/G v.s C/G), or reversed alleles(A/G, C/T) in kgCHBb37.txt. So I have to extract the overlapped samples from hapmapCHBb37.txt and kgCHBb37.txt and correct the alleles (for A/T and C/G, I have to use the allele frequency), then I have the alleles correction file correct.snp. I use the original hapmapCHBb37.txt and kgCHBb37.txt to check the correct.snp, and the result is good. But when I use correct.snp to correct the alleles in sampleCHB_b37.txt, the result is very bad. I'm wondering,
1> What's the best strategy to integrate SNP from whole-genome sequencing and affy6.0?
2> Maybe the quality control is not good in the sampleCHB_b37. I'm using Bowtie(-k 2 -v 2) to get SAM files from fastq (reference genome: hg19.fasta), then convert SAM to BAM, sorting BAM, convert the BAM to BCF(D2000), and convert the BCF to VCF with samtools. Is this framework reliable to call SNPs from fastq?
Each people have their own SNP, so the result of different sample sequencing data are different. I have many genotype data and compared with hapmap, they are some discordance, too. So do not worry.