Post-GWAS: Determining whether SNPs represent an indel, given only SNPs
0
1
Entering edit mode
5.8 years ago
michael.nagle ▴ 100

A number of SNPs appear to be associated with a trait of interest. I wish to look into all of these SNPs and get an idea of how they affect genes and whether they correspond to insertions or deletion mutations (indels).

I do not have access to genome sequence data for the GWAS population, only SNP files (28M SNPs for ~900 individuals with ~350Mb genome), which I have in a number of formats including .bed, .ped, .tped, and .vcf. The available dataset only includes the SNP identity, which is A, G, C, T or missing. This is from high-coverage genome data and missing SNPs presumably correspond to indels.

Given the information in these files, is there a way for me to find out if SNPs correspond to indels or point mutations? I could just identify SNPs that are missing in some genotypes and assume there's an indel at each specific marker position, but I hope to actually be able to figure out how long the indel is and where it starts and stops, and to do this on a high-throughput scale automatically for hundreds of SNPs.

Is there an existing GUI, command line or R interface for determining if SNPs correspond to indels, or better yet, reconstructing genomes from SNP files so that I can align different alleles and see clearly how SNPs can correspond to indels?

Edit: This is for a non-model plant organism.
Edit 2: Please find below a sample of my .vcf enter image description here

GWAS • 2.4k views
ADD COMMENT
0
Entering edit mode

Do you have rsIDs of the variants?

ADD REPLY
0
Entering edit mode

No, I'm working with a non-model plant organism.

ADD REPLY
0
Entering edit mode

Can you show us a sample of your VCF? The information should be in there. If not, it's not VCF and you need to send it back to whoever made it for you.

ADD REPLY
0
Entering edit mode

I've edited my post to include a sample of the VCF file. Thanks.

ADD REPLY
0
Entering edit mode

Is your VCF data now in Excel format, or in a R object? - not much good there. To view indels, and if you still have the original VCF, you simply need to do:

bcftools view --types indels my.vcf
ADD REPLY
0
Entering edit mode

My VCF is in VCF format, but for the purpose of this post I opened it in a spreadsheet program and took a screenshot so that it would be human-readable.

This bcftools command would work if indels were called in this VCF. As you can see in the screenshot, only SNPs have been called. My question is about determining where indels are, given SNP data. It is common in GWAS for all variation to be represented one base at a time, as SNP calls, without indels being called separately.

My VCF shows gaps at many positions for certain genotypes, followed by series of SNPs, and from this there should be a way to call indels.

ADD REPLY
1
Entering edit mode

In VCF indels should be represented like GA A for deletions or G GA for insertions. If you do not have this representation in your VCF then it is not correct VCF. You need to send this back to the source and ask for the correct version.

ADD REPLY
0
Entering edit mode

I see - thanks. I am not aware of anything that takes a VCF and calls indels from it. Obviously, it would be better to go back to the BAM so that you can take a look at the reads and make indel calls from that stage.

Are you sure that these gaps are not due to other reasons, such as QC and / or probe failure?

ADD REPLY
0
Entering edit mode

I doubt all the gaps are indels. I'll try getting the sequence data. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1705 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6