Hello all,
I am a relatively new user of GATK so my question may be considered somewhat basic. I am wondering if someone can provide me with an explanation of why the tags in the INFO field of GVCF files (as output by HaplotypeCaller) and standard VCF files (as output by GenotypeGVCF) differ. I'd be unsurprised if the answer was to some extent in my question, but I'm struggling to find any information online as to how these parameters are derived in each case.
The reason I ask this is that I am currently attempting to use GATK to conduct SNP and indel calling from RNAseq data from pathogen infected plants. I am at a stage where i need to conduct hard filtering (as I am working on a non-model organism with no prior data and thus VQSR is not an option), and I am confused by the following circumstances. GATK best practices for variant calling from RNAseq data seem dictate that I conduct VariantFiltration directly following use of HaplotypeCaller (i.e. without using GenotypeGVCFs to generate standard VCF file). However, guidance from the GATK website for such filtering discusses filtering by many parameters that are not present in GVCF files, like FisherStrand (FS) and StrandOddsRatio (SOR) for example. Since these tags are not present in GVCF files I'm assuming running a command like the one below would not actually do anything to the data?
gatk VariantFiltration \ -R reference.fasta \
-V sorted_dupsmarked.g.vcf \
-O sorted_dupsmarked_filtered.g.vcf
--filter-expression " --filter-name "FS60" \
--filter-expression "FS > 60"
--filter-name "SOR" \
--filter-expression "SOR > 3"
Furthermore, examples of methods in papers by other researchers frequently filter by these parameters, usually after using HaplotypeCaller and GenotypeGVCFs to generate their multi-sample VCF. So I wonder, are they doing something wrong by using GenotypeGVCFs on GVCFs derived from RNAseq data?
What am i missing here?
Thanks in advance for any attempts to help.