How To Analyse Snp Data From Different Sources?
2
3
Entering edit mode
13.7 years ago

I have 3 tables with SNPs and corresponding diseases. How can I get some statistics on them: which are in all lists, which are unique, and if possible make Euler's circles graph. Additionally, it would be good to find some tool for making graphs like this:

alt text

snp comparison visualization • 3.3k views
ADD COMMENT
3
Entering edit mode

aside from the graphical representation this looks to me like a very simple matching script. what is exactly keeping you from coding it? a better description of the input may probably help R experts to suggest great visualization scripts for your data.

ADD REPLY
5
Entering edit mode
13.0 years ago
Rlong ▴ 340

In order to find which SNPs are common to all groups, you could use a tool like bedtools or joinx to intersect the lists, assuming you have them in BED or VCF files. This also assumes that your SNP lists are all on the same reference.

In order to find the common SNPs between all of the lists try:

joinx intersect -a list1.bed -b list2.bed -o common_to_lists_1_2.bed 
joinx intersect -a list3.bed -b common_to_lists_1_2.bed -o intersection.bed

The resulting intersection.bed gives you those SNPs common to list1,list2, and list3. Then if you want to find the total number of recurring SNPs, run one more set of intersections:

joinx intersect -a list1.bed -b list3.bed -o common_to_lists_1_3.bed
joinx intersect -a list2.bed -b list3.bed -o common_to_lists_2_3.bed

Now find the difference between the 1-2, 1-3, and 2-3 lists and the 1-2-3 list:

joinx intersect -a common_to_lists_1_2.bed -b intersection.bed --miss-a lists_1_2.bed
joinx intersect -a common_to_lists_1_3.bed -b intersection.bed --miss-a lists_1_3.bed
joinx intersect -a common_to_lists_2_3.bed -b intersection.bed --miss-a lists_2_3.bed

and your lists__.bed will contain all those SNPs which are shared by 2 lists. To find the total number of SNPs found on more than one list:

wc -l lists_?_?.bed intersection.bed
ADD COMMENT
2
Entering edit mode
13.0 years ago

Jorge offers a good point. On the other hand, LD (linkage disequilibrium) could be used to "match" different SNPs that tag each other. So, before you perform the simple match with a script, you should collect LD data, say at r^2 = 0.90 or higher, on these SNPs and then run the match to see if any SNPs with different IDs actually belong to the same block of tightly (genetically) linked SNPs.

Statistics on association with disease could be garnered from the GWAS catalog at www.genome.gov or by mining OMIM.

ADD COMMENT

Login before adding your answer.

Traffic: 2291 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6