A number of SNPs appear to be associated with a trait of interest. I wish to look into all of these SNPs and get an idea of how they affect genes and whether they correspond to insertions or deletion mutations (indels).
I do not have access to genome sequence data for the GWAS population, only SNP files (28M SNPs for ~900 individuals with ~350Mb genome), which I have in a number of formats including .bed, .ped, .tped, and .vcf. The available dataset only includes the SNP identity, which is A, G, C, T or missing. This is from high-coverage genome data and missing SNPs presumably correspond to indels.
Given the information in these files, is there a way for me to find out if SNPs correspond to indels or point mutations? I could just identify SNPs that are missing in some genotypes and assume there's an indel at each specific marker position, but I hope to actually be able to figure out how long the indel is and where it starts and stops, and to do this on a high-throughput scale automatically for hundreds of SNPs.
Is there an existing GUI, command line or R interface for determining if SNPs correspond to indels, or better yet, reconstructing genomes from SNP files so that I can align different alleles and see clearly how SNPs can correspond to indels?
Edit: This is for a non-model plant organism.
Edit 2: Please find below a sample of my .vcf
Do you have rsIDs of the variants?
No, I'm working with a non-model plant organism.
Can you show us a sample of your VCF? The information should be in there. If not, it's not VCF and you need to send it back to whoever made it for you.
I've edited my post to include a sample of the VCF file. Thanks.
Is your VCF data now in Excel format, or in a R object? - not much good there. To view indels, and if you still have the original VCF, you simply need to do:
My VCF is in VCF format, but for the purpose of this post I opened it in a spreadsheet program and took a screenshot so that it would be human-readable.
This bcftools command would work if indels were called in this VCF. As you can see in the screenshot, only SNPs have been called. My question is about determining where indels are, given SNP data. It is common in GWAS for all variation to be represented one base at a time, as SNP calls, without indels being called separately.
My VCF shows gaps at many positions for certain genotypes, followed by series of SNPs, and from this there should be a way to call indels.
In VCF indels should be represented like GA A for deletions or G GA for insertions. If you do not have this representation in your VCF then it is not correct VCF. You need to send this back to the source and ask for the correct version.
I see - thanks. I am not aware of anything that takes a VCF and calls indels from it. Obviously, it would be better to go back to the BAM so that you can take a look at the reads and make indel calls from that stage.
Are you sure that these gaps are not due to other reasons, such as QC and / or probe failure?
I doubt all the gaps are indels. I'll try getting the sequence data. Thanks.