Question

Getting the intersection of two dbSNP lists

0

Entering edit mode

9.9 years ago

devenvyas ▴ 760

I have two lists of rs ids. One for an old chip I have data from (Illumina Human CNV370), and one for a new chip that I will be running sooner or later (Affymetrix Axiom Human Origins 1). I want to find out what overlap that they have.

I was wondering, how could I go about figuring out how many dbSNP rsids are common between the two? I have tried Excel, but that was a dumb idea as it exceeded the amount of memory the program could use. Any suggestions on how to do this? Thanks!

snp • 2.2k views

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by devenvyas ▴ 760

Ram · Accepted Answer · 2015-01-16

1

Entering edit mode

9.9 years ago

Alex Reynolds 36k

The following will count how many SNPs are in both files some_snps.txt and other_snps.txt:

$ grep -Ff some_snps.txt other_snps.txt | wc -l

The -F flag will enforce a full match of SNP IDs between files.

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Alex Reynolds 36k

0

Entering edit mode

Thanks! I'll try it out sometime tomorrow (when I have access to terminal). So to be clear, this will give me the number of rsids that are found both lists (so the rsids found in only one list will not be counted), right?

Is there anyway I can get an output listing specifically which rsids were found in both of the lists?

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by devenvyas ▴ 760

1

Entering edit mode

You need to read the manual on grep, use 'man grep'. It's quite long, so I'll summarize. grep -f will search for all entries in the some_snps.txt among the given file other_snps.txt, and report hits. pipe to wc (word count) dash L will count the lines reported. The reason these nuances are important is that if your search list has "rs123" it will cause grep to find it among lines "rs1234" and "rs123100", seriously inflating the results. There's another flag you can give grep, capital F, to force 'fixed' search of just exactly what was given in the input list.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by karl.stamm 4.1k

0

Entering edit mode

If you want the list, take out the | wc -l portion that counts the number of lines coming out of the grep statement. And thanks to Karl for pointing out the missing flag in my answer.

ADD REPLY • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Alex Reynolds 36k

Ram · Accepted Answer · 2015-01-17

1

Entering edit mode

9.9 years ago

Floris Brenk ★ 1.0k

Another option is comm (when using linux terminal). First sort both lists in terminal

sort listA.txt > sorted_listA.txt
sort listB.txt > sorted_listB.txt

then

comm -12 sorted_listA.txt sorted_listB.txt > common_snps.txt

ADD COMMENT • link updated 2.7 years ago by Ram 44k • written 9.9 years ago by Floris Brenk ★ 1.0k