Question

Get Rs Number Based On Position (6 million SNPs)

3

Entering edit mode

9.2 years ago

joshf ▴ 50

I know this question has sort of been asked before....but I need to know which method would be the most efficient way to get the Rs numbers based on position (hg19)

I've considered looping through two files, the .txt file (with the positions) and a .vcf file with all known variants from Kaviar Genomic Variant Database, locally...but that would take forever...

would installing a partial UCSC genome MySQL database locally be a better idea?

Any suggestion would be great...be as detailed as possible please :).

PS: This .txt file is an output from METAL, and unfortunately I need all 6.4M SNPs for my project at this point

SNP • 8.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by joshf ▴ 50

1

Entering edit mode

I have following questions:

Do you have variants (~6 m SNVs) annotated with dbSNP?
What are the outputs supported by METAL and Would you please share a link to METAL?

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by cpad0112 21k

0

Entering edit mode

Well, I only received the METAL output TXT file so I don't actually know how it works, but it has the chromosomal position and p-value which are important to me. Usually I would have to filter by p-value, but for this particular project I can't....and no, nothing's been annotated

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by joshf ▴ 50

0

Entering edit mode

Thanks for the reply. From the link you have provided, it seems rsnumbers are provided in first column of file METAANALYSIS1.TBL (section 5.5). Could you please post first few lines of the output text here?

I guess you are looking for rs positions given rs id as rs ids are already present in output (as mentioned in the link provided above).
Your question seems to be other way round. Given a position, you are looking for rsid.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by cpad0112 21k

0

Entering edit mode

In this case instead of the rs numbers I was given chr:pos as the SnpID. I have the positions, but what I'm missing are the rs numbers. Sorry for the confusion, and thanks for your help!

First 10 lines:

SNPID Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P Direction HetISq HetChiSq HetDf HetPVal CHR BP
5:29439275 t c 0.3751 0.0198 0.3168 0.4092 0.0100 0.0213 0.6397 +---+--+-+-+- 0.0 8.190 12 0.7701 5 29439275
5:85928892 t c 0.0643 0.0036 0.0626 0.0782 0.0916 0.0548 0.09478 +????+??-?-?? 45.1 5.462 3 0.1409 5 85928892
11:79226604 t c 0.1580 0.0117 0.1103 0.1875 -0.0301 0.0270 0.2644 -+---+-+----? 0.0 7.652 11 0.7441 11 79226604
10:45271908 c g 0.1391 0.0082 0.1056 0.1652 -0.0076 0.0296 0.7965 -+-?-+++----? 0.0 7.870 10 0.6415 10 45271908
15:80233956 t c 0.2009 0.0172 0.1191 0.2224 -0.0030 0.0256 0.9057 ++++----++--? 0.0 6.415 11 0.8443 15 80233956
10:128341232 t c 0.4460 0.0196 0.4298 0.5305 -0.0120 0.0205 0.5564 +------++++-- 15.9 14.269 12 0.2839 10 128341232
8:13397768 a g 0.8479 0.0100 0.8204 0.8662 0.0050 0.0284 0.8615 +-+---++++++? 11.4 12.413 11 0.3334 8 13397768
7:156821965 a g 0.0357 0.0039 0.0310 0.0390 -0.0336 0.0915 0.7138 -?????????-?? 0.0 0.040 1 0.8417 7 156821965
16:82370655 a g 0.1119 0.0062 0.1036 0.1267 -0.0205 0.0351 0.5589 ----?++-+?+-? 0.0 7.649 9 0.5698 16 82370655

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by joshf ▴ 50

1

Entering edit mode

Your data is not sorted by chromosome and coordinates.
Here is the way I see a simple, but round about work:
- Data has around 17 columns. Extract chromosome number, coordinate, bases (ref and alt), in that order into a separate text file.
- Now you will have chromosome number, base position, two bases (ref and alt) in new file.
- Duplicate base position and paste it next to existing base position column. Now the file will have five columns: Chromosome, Base position, Base position, Ref base and alt base
- Download genome build specific dbSNP file.
- Install bedtools.
- Use intersect command. This would not only fetch the dbSNP IDs for your variants, it would fetch other important information as well from dbSNP.

Following is the example code (on linux):

I copy/pasted first three lines (variants on chr5, including header) from furnished example (above) and saved it as chr5.txt
Using awk I extracted chromosome, position, ref and alt alleles. In the process I duplicated position column twice. (this is to make a bed file). Command: awk '{print $16,$17,$17, $2,$3}' chr5.txt > chr5.bed
Made sure that each column is separated by a tab as the original columns are not tab separated
Deleted the header line. Command: tail -n +2 chr5.bed > chr5_1.bed
Used bedtools to intersect chr5_1.bed file with chr5.dbsnp141.hg19.vcf. (Note that output file is not supplied. User can provide an output file to save the results). Command: bedtools intersect -a chr5_1.bed -b chr5.dbsnp141.vcf -wb

Output is given below:

5    29439275    29439275    t    c    5    29439275    rs667647    T    C    .    .    RS=667647;RSPOS=29439275;RV;dbSNPBuildID=83;SSR=0;SAO=0;VP=0x050100000005170517000100;WGT=1;VC=SNV;SLO;ASP;VLD;G5A;G5;HD;GNO;KGPhase1;KGPROD;OTHERKG;PH3;CAF=[0.5037,0.4963];COMMON=1
5    85928892    85928892    t    c    5    85928892    rs113534962    C    T    .    .    RS=113534962;RSPOS=85928892;dbSNPBuildID=132;SSR=0;SAO=0;VP=0x050000000005140116000100;WGT=1;VC=SNV;ASP;VLD;GNO;KGPhase1;KGPROD;OTHERKG;CAF=[0.8898,0.1102];COMMON=1

Original records I started with:

SNPID Allele1 Allele2 Freq1 FreqSE MinFreq MaxFreq Effect StdErr P Direction HetISq HetChiSq HetDf HetPVal CHR BP
5:29439275 t c 0.3751 0.0198 0.3168 0.4092 0.0100 0.0213 0.6397 +---+--+-+-+- 0.0 8.190 12 0.7701 5 29439275
5:85928892 t c 0.0643 0.0036 0.0626 0.0782 0.0916 0.0548 0.09478 +????+??-?-?? 45.1 5.462 3 0.1409 5 85928892

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by cpad0112 21k

0

Entering edit mode

Yes, thank you so much for your help! This was exactly what I was looking for

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by joshf ▴ 50

1

Entering edit mode

dbSNP files are huge and it would take considerable time for intersecting two big files. If you have time and familiar with R, try data.table. Authors claim to it to be fastest in intersection/overlaps. I would suggest recently implemented foverlaps function in data.table package.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by cpad0112 21k

0

Entering edit mode

I am sorry for initiating the inactive thread here. I would like to convert the CHR Pos to rsIDs in a GWAS summary statistic file. And, I just found the required tools and commands here. Tried with the following command.

**intersectBed -sorted -a sumstats.bed -b 00-All.vcf -wb**

But, unfortunately, merging with the reference genome GRCh37 (hg19) and summary stat file resulted in several duplicate positions (mainly in insertion and deletion regions as well as a singe base pair difference regions). Using different parameters such as -wa or -wa -wb also did not work. Please let me know how to fix this error.

enter image description here

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 3.3 years ago by Helix ▴ 20

Ram · Answer 1 · 2015-10-03

0

Entering edit mode

9.2 years ago

Pierre Lindenbaum 164k

Download the dbsnp VCF and query with tabix.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

could you elaborate a bit? how can I use tabix to get the rs numbers from the VCF file?

ADD REPLY • link 9.2 years ago by joshf ▴ 50

Ram · Answer 2 · 2015-10-03

0

Entering edit mode

9.2 years ago

Veera ▴ 90

Download the coordinates BED file from UCSC according to your build.

Use GenomicRanges package from bioconductor in R to overlap the your position file and the downloaded bed file. Subset only the SNPs from overlapped region.

But you need to know to use GenomicRanges package for this. Its very fast and easy to do.

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Veera ▴ 90

Ram · Answer 3 · 2017-04-24

0

Entering edit mode

7.6 years ago

ShirleyDai ▴ 50

Hello,

I just saw your output from METAL has the column "CHR" and "BP".

Can you post you metal script as how to add these two variables to the result file. Because I have searched the whole website but cannot find a solution, so I came here to ask if you can help...

Thanks
Shirley

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 7.6 years ago by ShirleyDai ▴ 50

score 0 · Answer 4 · 2017-04-27

0

Entering edit mode

7.6 years ago

Bioaln ▴ 360

To avoid the double looping, try and hash parts of the genome for easier access (key region, depends on your RAM, in limit there is a key->acc mapping). This can reduce the computational time significantly.

ADD COMMENT • link 7.6 years ago by Bioaln ▴ 360