Predicting Phenotype From Snp Data. Help!
4
5
Entering edit mode
14.4 years ago
Selflessgene ▴ 50

I'm not a bioinformatician by profession but this is a field that I'd like to learn on my own time, and who knows, maybe eventually making some useful findings.

I currently have SNP data for a full genome on about 150 individuals, along with a quantitative description of a phenotype.

How would I go about predicting the phenotype from a completely new set of individuals(for whom I have complete SNP data), given the previous information.

I downloaded PLINK and I'm currently formatting my data using python to agree with it. How can I use this (or another) tool to accomplish my goal? If the phenotype were say, height, I'd want to know which of my new individuals would end up being tall, and who would be short, etc. Ideally I'd want to rank them from tallest to shortest.

Links to explicit directions would be highly appreciated.

EDIT: Height was just an example of a quantitative/continuous phenotype. I'm not looking for height specifically.

snp plink • 8.3k views
ADD COMMENT
3
Entering edit mode

good luck ! :-)

ADD REPLY
0
Entering edit mode

Are you looking at a GWAS experiment ? Is this 150 total number of cases or controls or both ? What do you mean by 'new individual' ?

ADD REPLY
0
Entering edit mode

Yes, this would be GWAS data. For the original set of 150, they're not cases or controls as I understand the terms. Cases and controls are applied to binary phenotypes, ie. you either have the disease or you don't.

The phenotype I have is continuous, and could be real number from, say 0 to 10.

For "new individualss", I mean a new set of persons for whom we have full SNP data, but for whom we don't have information on the phenotype.

Can I use information gained from the original set to predict phenotype in the "new individuals"

Is this clear?

ADD REPLY
0
Entering edit mode

Sounds like an exciting project :) ! AFAIK, you should have a case and control (not only from the perspective of diseases) for example in case of height you can have set of cases(height x) and controls(height y) to derive a p-value for the genotype-phenotype association.

ADD REPLY
0
Entering edit mode

Sounds like an exciting project :) ! I think your question is in 2 parts. 1) You have genotype and phenotype information and 2) You need to analyze the association and use the information from the association for a prediction. Please let me know if I got it right or not ?

ADD REPLY
0
Entering edit mode

Yes, your interpretation is correct. Any suggestions on how to tackle part 1 or 2 in your comment? This is my first crack at any bioinformatics.

ADD REPLY
11
Entering edit mode
14.4 years ago
Neilfws 49k

In terms of software, I believe that the R package GenABEL is popular and used widely.

In terms of other things - one of the best ways to get started is to read about how other people perform this type of task. There are some excellent genetics blogs around; I particularly recommend Getting Genetics Done, which has articles tagged with GWAS. The new blog, Genomes Unzipped, also looks very good and has a recent post on how to interpret GWAS. Finally of course, you should read the literature. A good recent example is this PLoS Genetics paper (open access) out of the company 23AndMe, which does a good job in describing the methods used.

GWAS is essentially a statistical problem, so perhaps it is worth outlining some very basic concepts. You have observations (traits or phenotypes) and a set of variables (markers or SNPs). You hypothesize that some of those markers explain the traits. Important things to remember are:

  • There are many more markers than traits (what statisticians call a "p >> n problem"
  • That means that some form of multiple testing is required
  • It also means that most markers do not affect the traits but may appear to do so (noise)
  • The markers themselves are unlikely to explain fully the traits

On that last point - most traits are affected by environmental factors - a large number to at least as great an extent as by the genetic factors. Good statistical software should indicate to what degree the variables contribute to the trait and whatever is "left over" is often assumed to be environmental - but this is often an educated guess.

If you keep these statistical ideas in mind (and be aware that there's a good deal of scepticism around as to the usefulness of GWAS), you won't go far wrong.

ADD COMMENT
0
Entering edit mode

Thanks for the links to the blogs. Looks like they start to get into some technical detail on implementation which I like!

ADD REPLY
6
Entering edit mode
14.4 years ago
Allpowerde ★ 1.3k

This paper might help you. Though the paper itself is not about predicting phenotypes, (from memory) it should have some reverences that cover this:

Common SNPs explain a large proportion of the heritability for human height. Yang J, Benyamin B, McEvoy BP, Gordon S, Henders AK, Nyholt DR, Madden PA, Heath AC, Martin NG, Montgomery GW, Goddard ME, Visscher PM.

Nat Genet. 2010 Jul;42(7):565-9.

Otherwise [?]Naomi Wray[?] (same group) has looked into this.

Finally, the [?]"Next-Gen GWAS"[?] workshop at this year's ICSB (15th Oct. 2010) will cover this issue by showcasing some new machine learning methods (e.g. multi-instance learning) that should be able to deal with this task effectively.

ADD COMMENT
0
Entering edit mode

The conference looks like it will address some relevant issues. Can't make it to Australia though!

ADD REPLY
0
Entering edit mode

Is 150 a big enough sample though?

ADD REPLY
0
Entering edit mode

@Deepak's point is very important. But the size largely depends on the question he is trying to ask using the association study. If it is a very specific study on a specific population 150 seems to be a good number. GWAS catalog have a variety of sample size reported here. http://www.genome.gov/26525384

ADD REPLY
0
Entering edit mode

I would be suspicious of any study with < 1000 individuals. So would this author.

ADD REPLY
0
Entering edit mode

also, I think the GWAS workshop mentioned above is in Edinburgh, Scotland, despite the Australian organising team.

ADD REPLY
1
Entering edit mode
14.4 years ago

Answers by Neil & allPowerde provides very detailed answers for the GWAS analysis.

For the prediction part 23andMe and other PGx companies use GWAS data for associating genotype with phenotype. But in such a predictive context the reproducibility, sample size and case-control information is very important. For example, 23andMe uses SNPs from GWAS with at least 750 cases and appropriately chosen controls [Please read 23andMe How It works section at 23andMe for more details. Since you don't have a case-control model, or no information about data reproducibility am afraid the predictions may not give significant results.

ADD COMMENT
0
Entering edit mode

The question specified that he's trying to predict a quantitative phenotype. This means that he's performing a form of regression (predicting a real-valued number) not a classification task. A case-control design is not relevant here.

ADD REPLY
0
Entering edit mode

Thanks for the point about regression, you may add this as a separate answer. I never mentioned the notion of classification in my discussion and I am well aware that his problem is not a mere classification task. I suggested 23andMe as an example so that he can start from there (provided that, he mentioned that he is new to bioinformatics).

ADD REPLY
0
Entering edit mode
14.3 years ago
Bi Yong • 0

Predicting Unobserved Phenotypes for Complex Traits from Whole-Genome SNP Data PLoS Genet, Vol. 4, No. 10

It would give an idea how to predict new individuals from SNP data for complex traits. I wish it helps.

ADD COMMENT

Login before adding your answer.

Traffic: 2401 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6