Hi all,
I have a WGS dataset in which indels were also called. Now I am looking towards creating this a reference dataset for the local population I am working on. But before using this as a reference set for imputation of Exome data for samples from this population, I would need to phase this WGS dataset.
The problem is that SHAPEIT and MACh can't handle indels within the files and usually have to be removed before phasing. My query is how do I phase the WGS dataset but also preserving the indels within it.
Any help is greatly appreciated!
Thanks for your reply. But if I do remove them then how would I be able to impute them in my genotyped samples? This way I would loose the indels which were called specifically earlier. And also, since I wouldn't be able to impute them in my samples then I can't use it for any downstream analysis also.
Nucleic acid molecules do not comprise any gap residues. Thus a sequence representing a real molecule must not have gaps. Make a copy of your gapped sequence before you delete the gaps.
Piet, I assume that your explanation doesn't concern diploid organisms? Because for phasing, you are effectively comparing two sequences, both alleles. It's a bit annoying indeed that you have to remove indels for phasing, but I understand it's an additional complexity the makers of the tool would like to avoid. What I would consider (and which is a bad workaround, which will likely come around and hurt you unexpectedly) is to substitute all indel variants for an artificial SNP variant. You keep the position on which you did this nasty trick and replace the real indels afterwards. Not sure on what your phasing algorithm is based. Doesn't hurt to try? It might just screw up everything but in that case we're smarter next time a similar problem arises.
That's an idea WouterDeCoster ! I know it will mess up with the phasing algorithm of shapeit2 in regards to haplotype estimation maybe. I will try this and see if I get something meaningful when I phase them. But it's strange that shapeit2 refuses to take indels even now when even the latest 1kg ref dataset has indels in them.