I have to do a snp calling on 40 or so samples a few of which originate from public sources. For these the raw data is not available in all cases.
Therefore I thought of building a dummy fastq paired dataset by chopping the reference into pieces using a window approach to add some coverage.
Any thoughts on this?
I would remove all monomorphic calls for this sample, and apply default filters like snps in repeat regions and near-indel-snps.
An alternative would be to compare the reference on which mapping will be done with this reference using Mummer. But then I would have to integrate the calls into the vcf and snp calling metrics would be absent for this sample.
Neither of the two I like very much but I don't see an alternative really.
Thanks for any suggestion.
Whats the goal of your analysis ? Do you want dummy SNPs for the samples where you don't have raw data ? Your question and approach is not clear.
Well, we want to know the genotype call for the raw-less sample(s), provided that the calls from the other samples are confident enough to accept the snp. We will then filter downstream for assay design taking the calls for all samples into account. So I would maybe not use the raw-less calls for quality filtering
The filtering will be done on the VCF so I need the calls in there.
removing the monomorphs will discard some true positives but we're not really interested in those, so that's not a big problem, except maybe that if they're in the flanking sequence they could hamper the assay.
I guess I have to weigh what's the most work, creating the dummy fastq or adding the calls done by direct reference comparison to the VCF.