How to phase a WGS dataset preserving the indels within it?
2
1
Entering edit mode
8.5 years ago
Shab86 ▴ 310

Hi all,

I have a WGS dataset in which indels were also called. Now I am looking towards creating this a reference dataset for the local population I am working on. But before using this as a reference set for imputation of Exome data for samples from this population, I would need to phase this WGS dataset.

The problem is that SHAPEIT and MACh can't handle indels within the files and usually have to be removed before phasing. My query is how do I phase the WGS dataset but also preserving the indels within it.

Any help is greatly appreciated!

SNP sequencing genome indel phasing • 2.5k views
ADD COMMENT
0
Entering edit mode
8.5 years ago
piet ★ 1.9k

Insertions and deletions is a concept, which only matters, if you compare two sequences. Technically, you cannot have indels within a single sequence string. You may have an annotation on a sequence which tells you that a particular region is an insertion (eg not present in other sequences from the same species).

You may refer to having gaps in your sequence. Gaps are usually written as '-'. You should squeeze them out, if an application does not allow them. For example, it is forbidden to submit sequences with gaps to Genbank.

ADD COMMENT
0
Entering edit mode

Thanks for your reply. But if I do remove them then how would I be able to impute them in my genotyped samples? This way I would loose the indels which were called specifically earlier. And also, since I wouldn't be able to impute them in my samples then I can't use it for any downstream analysis also.

ADD REPLY
0
Entering edit mode

Nucleic acid molecules do not comprise any gap residues. Thus a sequence representing a real molecule must not have gaps. Make a copy of your gapped sequence before you delete the gaps.

ADD REPLY
1
Entering edit mode

Piet, I assume that your explanation doesn't concern diploid organisms? Because for phasing, you are effectively comparing two sequences, both alleles. It's a bit annoying indeed that you have to remove indels for phasing, but I understand it's an additional complexity the makers of the tool would like to avoid. What I would consider (and which is a bad workaround, which will likely come around and hurt you unexpectedly) is to substitute all indel variants for an artificial SNP variant. You keep the position on which you did this nasty trick and replace the real indels afterwards. Not sure on what your phasing algorithm is based. Doesn't hurt to try? It might just screw up everything but in that case we're smarter next time a similar problem arises.

ADD REPLY
0
Entering edit mode

That's an idea WouterDeCoster ! I know it will mess up with the phasing algorithm of shapeit2 in regards to haplotype estimation maybe. I will try this and see if I get something meaningful when I phase them. But it's strange that shapeit2 refuses to take indels even now when even the latest 1kg ref dataset has indels in them.

ADD REPLY

Login before adding your answer.

Traffic: 2076 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6