Hi,
I am sort of new in the field. I want to know two things
If someone has worked with PLINK for association studies in bacteria. The chase is that I have a gene presence/absence table and want to assess if one of those genes is significantly related to a particular phenotype. Is this possible with PLINK? I actually saw someone do it and I would like to understand the rationale behind the formatting of the .ped and .map files as well as the analysis.
As far as I remember, the affected (case) and unaffected (control) groups are my bacterial phenotypes, but there's more than that. I think there are some columns to add to those files.
If someone has more experience, please let me know.
Not sure if this is the appropriate place to ask this. If not, my apologies.
Thanks for your reply Kevin. My data is a table containing groups of bacteria in the rows and in the columns there are genes or gene families. When I mentioned phenotypes in the original question, I actually meant "taxa". So my idea is that I can use PLNK to show that certain genes are uniquely present in certain closely related groups of bacteria (say subspecies or strains) or that they are "associated" with a particular taxon. For example: species 1 has gene X that is not present in species 2 , 3, 4 and 5. I am guessing ploidy is a limitation that could be addressed by formatting the data table in a way that it resembles a diploid organism. Here is an example of the table I have.
https://drive.google.com/open?id=1Hzj26cT3rHHT5zTvegkTHN6dVmB7naVu
Converting to binary is a must, as far as I remember. After that I'm quite lost.
Thanks for your help
I see. I am beginning to think that you should do this entirely outside of Plink, like, using some of the tests that I mentioned in my other thread. With those, you can see if a gene is more frequent in a particular bacteria or taxa. What do you think?
Another thing that you could do with your data is to define a gene signature that could be used as a sort of 'identifier' of the taxa that you are aiming to distinguish. For example, you could ultimately say that Gene1+Gene4+Gene7+Gene8 can statistically distinguish Taxa1 from Taxa2 (AUC, 0.95; cross-validated r^2, 0.6). If you want to learn more about that, you can take a look here: C: Resources for gene signature creation
Not sure if that helps.
I will take a look at that. Maybe you are right and PLNK is not the most straightforward answer for this question. I'll update on progress if necessary. Thanks again.
Okay - please come back when you have updated information.