I am new to this field & i need to identify SNPs. So i tried to align my seq to reference using bwa and then used samtool to call variant. now am having vcf files as my output. i viewed the vcf file through IGV browser, from viewer i came to know it contains so many noisy data. can anybody help me to do this work further.
What species are you working with? If you install snpEff and there is an annotation present then you can annotate the vcf files and view the resulting report which gives you some graphs and tables to help review your SNP data. You can then use the companion program snpsff to filter based upon quality scores you establish from the report or if you want only homozygous snps etc.
If you are working with human samples, you could try and upload your VCF files to GeneTalk Analyze Human Sequence Variants. Your file will be annotated in the background during preprocessing and you can then filter it for effects on protein level, mode of inheritance, genotype frequency, annotations existing in the database (dbSNP, HGMD, 1kGP...)
Each time you filter a file, a new file with reduced variants and the filtering settings in the header is generated. If you want (and if you have the sharing consent) you can share the data with a colleage that is registered at GeneTalk and collaborate together. However, the data is only stored in your account and only you have acces to it. +
Up to now the platform GeneTalk is freely accessible.
What species you are working with ? My suggestion is to call variant with GATK and with extended option like Recalcibration and ReAlignments etc. Thieu are always a probability to get false positive result because of several background processing errors. We can not remove those errors completely but we can in a lesser proportions. Later you can put across the suggestion given by Alexej and Rob.
The use of a realigner is recommended before samtools or GATK to reduce false positives but GATK snp calling and base recalibartion is dependent upon a known accurate SNP file so if this is not available (there is a limit at the moment on the species that are available) then maybe consider whether to use this although it is the gold standard for human and files available for this. I believe you can create a known file by filtering a samtools snp vcf file of snps of a high quality i.e 214+ and use this as a known file in GATK if known file available.
The use of a realigner is recommended before samtools or GATK to reduce false positives but GATK snp calling and base recalibartion is dependent upon a known accurate SNP file so if this is not available (there is a limit at the moment on the species that are available) then maybe consider whether to use this although it is the gold standard for human and files available for this. I believe you can create a known file by filtering a samtools snp vcf file of snps of a high quality i.e 214+ and use this as a known file in GATK if known file available.
Species is arabidopsis, plant genome