I have two lists of rs ids. One for an old chip I have data from (Illumina Human CNV370), and one for a new chip that I will be running sooner or later (Affymetrix Axiom Human Origins 1). I want to find out what overlap that they have.
I was wondering, how could I go about figuring out how many dbSNP rsids are common between the two? I have tried Excel, but that was a dumb idea as it exceeded the amount of memory the program could use. Any suggestions on how to do this? Thanks!
Thanks! I'll try it out sometime tomorrow (when I have access to terminal). So to be clear, this will give me the number of rsids that are found both lists (so the rsids found in only one list will not be counted), right?
Is there anyway I can get an output listing specifically which rsids were found in both of the lists?
You need to read the manual on grep, use 'man grep'. It's quite long, so I'll summarize.
grep -f
will search for all entries in the some_snps.txt among the given fileother_snps.txt
, and report hits. pipe towc
(word count) dash L will count the lines reported. The reason these nuances are important is that if your search list has "rs123" it will cause grep to find it among lines "rs1234" and "rs123100", seriously inflating the results. There's another flag you can give grep, capital F, to force 'fixed' search of just exactly what was given in the input list.If you want the list, take out the
| wc -l
portion that counts the number of lines coming out of thegrep
statement. And thanks to Karl for pointing out the missing flag in my answer.