I know this question has sort of been asked before....but I need to know which method would be the most efficient way to get the Rs numbers based on position (hg19)
I've considered looping through two files, the .txt file (with the positions) and a .vcf file with all known variants from Kaviar Genomic Variant Database, locally...but that would take forever...
would installing a partial UCSC genome MySQL database locally be a better idea?
Any suggestion would be great...be as detailed as possible please :).
PS: This .txt file is an output from METAL, and unfortunately I need all 6.4M SNPs for my project at this point
I have following questions:
Well, I only received the METAL output TXT file so I don't actually know how it works, but it has the chromosomal position and p-value which are important to me. Usually I would have to filter by p-value, but for this particular project I can't....and no, nothing's been annotated
Thanks for the reply. From the link you have provided, it seems rsnumbers are provided in first column of file
METAANALYSIS1.TBL
(section 5.5). Could you please post first few lines of the output text here?In this case instead of the rs numbers I was given chr:pos as the SnpID. I have the positions, but what I'm missing are the rs numbers. Sorry for the confusion, and thanks for your help!
First 10 lines:
Following is the example code (on linux):
awk '{print $16,$17,$17, $2,$3}' chr5.txt > chr5.bed
tail -n +2 chr5.bed > chr5_1.bed
chr5_1.bed
file withchr5.dbsnp141.hg19.vcf
. (Note that output file is not supplied. User can provide an output file to save the results). Command:bedtools intersect -a chr5_1.bed -b chr5.dbsnp141.vcf -wb
Output is given below:
Original records I started with:
Yes, thank you so much for your help! This was exactly what I was looking for
dbSNP files are huge and it would take considerable time for intersecting two big files. If you have time and familiar with R, try data.table. Authors claim to it to be fastest in intersection/overlaps. I would suggest recently implemented foverlaps function in data.table package.
I am sorry for initiating the inactive thread here. I would like to convert the CHR Pos to rsIDs in a GWAS summary statistic file. And, I just found the required tools and commands here. Tried with the following command.
**intersectBed -sorted -a sumstats.bed -b 00-All.vcf -wb**
But, unfortunately, merging with the reference genome GRCh37 (hg19) and summary stat file resulted in several duplicate positions (mainly in insertion and deletion regions as well as a singe base pair difference regions). Using different parameters such as
-wa
or-wa -wb
also did not work. Please let me know how to fix this error.