I need to use some SNPs data for a prediction problem. I have around 2 Million SNPs, so I need to pre-filter this data in order to use them. I found two main tools: Tagger and SNAP. Unfortunately both tools requires to select the population sample, my data refer to several populations (total of 9 subpopulations). Is there any other tool that allow to select the most informative SNPs regardless the population characteristics?
But I'd like to use the data (i.e SNPs value for each individual in my dataset) I've got.
tagSNPs are used in order to select a limited amount of markers to be informative. what you are talking about is not really to find tagSNPs, but to prioritize the SNPs results you have. so you have already genotyped your samples, and what you are asking is which SNPs to use in your study? if you don't want to care about population and stratification issues, the only thing you may safely do is to filter out monomorphic SNPs, as they won't be informative at all. everything from then on, including LD patterns and of course allele frequencies, do heavily depend on population information, so unless you really explain why would you want to "pre-filter" your data I can't see a benefit from it. think that most of the tools that perform high throughput genotyping data analysis, like PLINK for instance, allow the load of all the experiment data at once, and they are the ones that "pre-filter" your data in case it's needed.
I want to use SNPs as attributes for a classification problems (i.e given an individuals with known SNPs plus other informations, then predict if he belongs to a particular class, for example if he is ill or not). Unfortunately I can't use the entire amount of SNPs (they are almost 2 Millions) so I want to select the "most informative" ones. I've never used SNPs data, so sorry for my imprecisions!
this sounds like an association study, and PLINK as mentioned is able to help you. what I don't understand is why you can't use all those SNPs, because tools that deal with PCA analysis, association studies, and all these bunch of typical analysis that are performed using microarray data are definitely capable of handling large amounts of genotypes. you certainly have an idea of how to proceed, but it isn't clear to me.
Because use 2 Millions of attributes is infeasible for any possible prediction algorithm!
that's why the programs that deal with this kind of data do try to reduce the problem themselves. what I don't understand is why you want to do it yourself, unless you are trying to develop a new algorithm. in that case, again, this is not explained in your question.