How To Calculate A Sample Size
2
1
Entering edit mode
12.9 years ago
Rnda ▴ 10

this is my first time doing this, so it's a little primitive but i want to know..

conducting an experiment, we will pick a sample of normal individuals, we do them some urine analysis to determine a concentration of a specific substance after having them swallow a single pill then we will sequence a specific gene for a metabolizing enzyme responsible for the clearance of that pill and correlate the result of both. we are not assuming any allele frequency or certain haplotype.

first: how can i calculate a sample size? is a "power sample size" applicable here? since i have no previous hypothesis to presume and no previous data to count on.

what is the appropriate software to handle the sequencing results for that purpose?

statistics • 4.0k views
ADD COMMENT
5
Entering edit mode
12.9 years ago

Absolutely you should do a "power analysis for sample size" before an experiment.

In order to do so, however, you do need a hypothesis. It sounds like you do have one -- you are hypothesizing there will be a difference in urine metabolite excreted by individuals that have certain sequence variants in your gene of interest compared to individuals who do not have those variants. The question is -- how much difference are you expecting between the groups, and what is the standard deviation of your test? This will determine what sample size you will need in order to detect the difference between the groups. If the expected difference is small -- but the standard deviation of your test is large, you will need more subjects to show the differences observed are not based upon chance. If the expected difference is large but the standard deviation of your test is small, you will need fewer subjects.

This is where you have to hazard a guess, and find a middle ground where you think you will be likely to detect real differences. Do you have any pilot data to draw from? Similar metabolic effects from similar compounds on other genes?

You can read more a little more about this here and there are other sites too. R has an easy way to calculate sample size, and any introductory R book that discusses power analysis for sample size will show you how.

ADD COMMENT
0
Entering edit mode

ok, it sounds that i have to learn this R. i've just started.

i have to correlate the absence or a presence of more than 30 haplotypes and their corresponding phenotypes to my results i.e the actually "from the urine" obtained phenotype divided in 3 categories. so i'm having here two categorical variables, is it a chi square test then or what? what do you mean by the standard deviation of my test? and how is the test is going to help me in sample size calculation?

ADD REPLY
4
Entering edit mode
12.8 years ago
Neilfws 49k

My advice, before doing anything else is: (1) think very hard about what your final data are going to look like, (2) try to determine which statistical tests are appropriate to your data and (3) if unsure, seek professional statistical advice from colleagues. Judging by your comment above, you would benefit from (3).

For example, calculation of appropriate sample size using statistical power may or may not be appropriate to your situation. Do you have a null hypothesis? Such as: I expect no significant difference in urinary metabolite concentration between groups A and B (where groups A and B are defined by a simple metric). If so, then determining the appropriate sample size for a t-test is useful; otherwise it is not.

It seems to me that you have a multivariate problem, to which simple analyses such as t-tests or chi-squared tests are not applicable. There are 30+ haplotypes (or phenotypes), which you hypothesize will have an effect on metabolite concentration. So your data are going to look something like this:

    v1    v2   v3    v4   ..  v30    C
n1  nv11  nv12 nv13  nv14 ..  nv130  n1C
n2  nv21  nv22 nv23  nv24 ..  nv230  n2C
n3  nv31  nv32 nv33  nv34 ..  nv330  n3C
n4  nv41  nv42 nv43  nv44 ..  nv430  n4C
..

Where v1, v2...v30 are the haplotypes/phenotypes; n1, n2... are the subjects (people); nv is the observation of variable v for subject n and in column C, the concentration measurement for each subject. If you were using R, you would create a data frame or matrix to represent the data as shown above. You would then explore the data in various ways, with the aim of understanding how variables v1..v30 contribute to the outcome, C.

If these things mean little to you, again I strongly suggest that you seek advice from your local friendly statistician and spend time thinking about the structure of your data and appropriate methods.

ADD COMMENT
0
Entering edit mode

Very nice answer, @neilfws, and +1 for recommending a consult with a friendly neighborhood statistician!

ADD REPLY
0
Entering edit mode

this helps a lot, the shape Neilfws suggested is v.convenient. as u said i will ask for more help on the statistics, at least that will get me out of my room lol i have another concern, other than the data presentation, bec. as i'm learning this R i think i might be able to do it the way u suggested. the issue is how can i make the best use of the sequencing data that shall be available soon, i mean constructing the haplotype comparing them finding novel SNPs and so on. what program should i start to learn along with the R?

ADD REPLY

Login before adding your answer.

Traffic: 1859 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6