Entering edit mode
6.1 years ago
bioinfo456
▴
150
I'm interning on an Alzheimer's disease project where I'm asked to build a classification model to classify the same. As of now, I have a dataset wherein its proven that the rs ids with a p-value less than 0.01 is sure to affect the gene expression for the disease and rs ids with p-value greater than 0.8 is considered healthy. So my question is, where can I find a dataset wherein I should be able to extract features like eQTLs, DNA stability, propensity value and build a classification model using the same. Any suggestions will be much appreciated. Thanks.
Have you tried GEO or dbGAP ? Also, why are you using a subset of IGAP only ? All of IGAP summary stats are here : IGAP
Thank you. Could you please elaborate this data? Are these the SNPs that influence Alzheimer's? If so, where can I find it's sequence to extract features out of it? Please help.
That kind of causal link is difficult to prove. Best you can hope for is a correlation between samples that have that variant and a particular diagnosis/marker for Alzheimer's.
Got you sir. Thanks a ton for your insight.
all SNPs in IGAP identified to date are "susceptible loci" -> meaning there are mostly likely to be associated with a RISK of developing late onset Alzheimer's . The link in my comment above gives you the data or the summary statistics for IGAP. IGAP was a big study that constituted of many many groups across Europe and USA to collaborate and share summary stats together to perform meta analysis. One of them is ADGC as well. The link provided by you is only a subset of IGAP. I would suggest to read the IGAP paper (link is provided by Ram below in his comments) to try to understand how the analysis has been done and what conclusion the authors have drawn from it.
try ADNI and AMP-AD
ADNI requires registration and I don't seem to find any SNP related datasets in AMP-AD. Thanks for your time.
Sorry, what? How did you obtain these p-values? And how can a p-value > 0.8 mean anything in any statistical test?
The dataset is in the following link : https://www.niagads.org/igap-summary-statistics-adgc-only
It is the result obtained after a certain experiment which is why they're able to say so. So ya, any dataset that you're aware of that could be of any help to me please?
Can you please show me where your resource says that a p-value above a threshold signifies anything? A larger p-value only means one thing in statistics: "The odds you're seeing this by chance is pretty high", which means "your results are not statistically significant". No inference can be made from such a p-value.
EDIT: The only mention I see is in the IGAP paper:
Is this what you're referring to? If it is, I can't make the connection between a r2 value, which is a measure of correlation and a p-value threshold.
I'll get back to you in a couple of days regarding this coz this is what I was told by my mentor. On the other hand, as for my understanding, I have a list of rs ids that influence alzheimer's. I need to extract a certain features out of it and build a classification model to classify whether a certain rs id falls within its class or not. How can I go ahead with this? Please help.
The terms you use and the approach you speak about looks a lot like machine learning. For classification, you'd need a well annotated truth set for training. I'm not a Machine Learning expert, maybe someone else can help you with that.
Kindly refer to the paper of the above mentioned link.
I think I can retrieve the data I need from an R package "rsnps". Could anybody tell me how I could select feature for the classification purpose please. Thanks for your time all of you :).