Hi,
I have a list of INDELS in their rsid format, and I am trying to convert this list from the rsids to a format like the one coming out of the imputation from MACH/Minimac, i.e. chr:pos:ALLELES.
I have tried using biomaRt to find the chr, positions, and alleles corresponding to the rsids like this:
library(biomaRt)
snpmart = useMart("snp", dataset="hsapiens_snp")
getBM(c("refsnp_id","allele","chr_name","chrom_start"), values="rs200623867", filters="snp_filter", mart=snpmart)
But I get this for an INSERTION:
refsnp_id allele chr_name chrom_start
rs146107628 -/T 10 100002842
That I would like to convert to this format:
10:100002841:C_CT I R
While for a DELETION:
rs200623867 G/- 10 100003302
That I would like to convert to this:
10:100003301:AG_A D R
So it looks like I am missing the information about the other allele when using biomaRt.
Is there maybe a better approach to completing this convertion in R?
Thank you!
Simone
Thank you for your reply! But even if I go one base before I am still having trouble with finding the "other" allele using biomaRt, so for example how would I know to convert
If I don't know that the other allele is C, but I only get T from biomaRt?
Well, if you get one base upstream, that base would be C, so you would say -/T with one base upstream was C_CT.
Thanks Emily, this seems very clever. Do you know if there is a way using biomaRt to request the upstream position based on rsids in one go, or do you think I first need to find the chr:pos for each of the variants listed in the rsids object, and then in a second step find the alleles at their chr:pos-1?
I am just trying to understand the quickest way to convert these IDs inside an R script, maybe biomaRt isn't the best way to go if it requires many steps, but I don't know of another package that could help with this...
Thanks again for your advice!
In the variant mart, there's a section called Sequences where you can specify the upstream sequences plus the alleles.