Hello Everyone,
I was trying freebayes on a small bam file and wanted to compare the quality of its output -vcf1- against the quality of the output of gatk halplotypecaller -vcf2- but I found there is a huge difference between both of them, for instance vcf1 was 4.4M and vcf2 was 140M.
When I tried to visualize both files I found that vcf2 has much more variants discovered than vc1.
Does anyone know why there is such difference? Does it have anything to do with gatk using known sites like dbsnp and freebayes don't?
I want to know because I was wondering if you can use FreeBayes instead of GATK in calling variants without affecting quality.
FreeBayes commandline:
freebayes -f human_g1k_v37.fasta out.bam > ~/out.vcf
GATK commandline:
java -jar gatk3.jar \
-T HaplotypeCaller \
-R human_g1k_v37.fasta \
-D dbsnp_vcf.vcf \
-o out.vcf \
-pairHMM VECTOR_LOGLESS_CACHING \
--emitRefConfidence GVCF \
--variant_index_type LINEAR \
--variant_index_parameter 128000 \
-A DepthPerAlleleBySample \
-stand_call_conf 30 \
-stand_emit_conf 10
Thanks in advance,
Shazly
140M is number of SNPs?
I think file size, based on the following sentence.