Hi there,
I have a problem, and I'm hoping someone might be able to point me in the right direction? I'd be very grateful.
I have a number of non-model plant RNA samples which we have sequenced deeply. We would like to conduct SNP discovery of these samples. Our ultimate goal with this genotypic data is to search for variants (both SNPs and indels) that fall within genes from a cohort of gene families, which may be responsible for phenotypic variation that we have assessed among our sample plants. But I've hit a point where i'm not 100% sure how to proceed.
I have used STAR aligner with default parameters to align my sample RNA files to the reference genome of the species I am dealing with, and then followed GATK best practices (https://gatk.broadinstitute.org/hc/en-us/articles/360035531192-RNAseq-short-variant-discovery-SNPs-Indels-) to produce a .VCF file for each of my samples. There are three things that give me pause, 1) I am not 100% percent sure I was correct in merging the .VCF files after calling SNPs, rather than merging .BAM files and then calling SNPs. 2) I'm not sure how to answer the question of what my filters should be re MAF, read-depth, base call quality, missing data, snp proximity to each other and proximity to indels, etc. It's been suggested to me that I should produce plots to make these decisions, but I'm not sure how to do that or what i should be looking for in such plots. 3) Lastly I am wondering if there is a way to deal with the problem of specific allele expression skewing the results, my assumption so far is that other studies seem to have accounted for this by merely using a low bar for the MAF filter (such as >= 0.1), thus ensuring that as many true SNPs as possible are captured.
Any insight into any of these questions, or a point in the direction of where I might find it, would be much appreciated.
Thanks for reading.