Hi Biostars,
I would like to learn how to convert the genome positions (e.g., Chr6: 467841) into other useful identifiers and annotations. For example, I use the vcftools to get only SNPs in a ".012" format, which also outputs the site locations (i.e., genome positions) in a ".012.pos" file. I use the following command:
vcftools --vcf xxx.vcf --out SNP --remove-indels --012
Basically, it creates "SNP.012" that only contain 0,1,2 values and "SNP.012.pos" that contains the site location like:
Chr1 2673
Chr1 2695
Chr1 2696
I would like to match these site locations (i.e., genome positions) to variant identifiers to genome annotations. I have some success in loading a gff3 file (e.g., NCBI genome annotation downloaded) and doing left/right joins in R. But it seems somewhat ad hoc. I tried to use Bioconductor packages (GenomicRanges, GenomicFeatures, biomaRt) but I couldn't find efficient/fast/best practices. FYI, I prefer working in R/Bioconductor.
Thanks!
I had to analyze the genotype matrix ("012" format) in R and find out "important" SNPs. I simply feel like there must be a straightforward way of going from the site location (genome position) to variant identifiers, gene id, and/or known annotations. In other words, if there is a list of site locations (like Chr1 2673), what's the best way of getting annotations from RefSeq, Ensembl, and such (downloaded in gff3 or gtf formats, or accessing via any API)? Any help would be appreciated!
Thanks for great suggestions. I look more into Annovar and SnpEff.