I have performed an analysis that yields the probability that a diploid DNA sequence is inherited from species A (genotype = AA), inherited from species B (genotype = BB) or one copy from each species is inherited (genotype = AB). The analysis yields three vectors of probabilities ranging from 0 to 1. I'll paste the header below so you can see what the output data look like. I need to convert these probabilities into an assignment of 0, 1, or 2, where 0 = AA, 1 = AB, and 2 = BB.
Can anyone suggest a strategy that uses unix or R tools to convert the genotype probabilities to genotypes that uses all three columns of information? I have to convert 19 files of these into genotypes.
Thanks!!!
Example output:
snp probAncestry(1,1) probAncestry(1,2) probAncestry(2,2)
S1_11928 0.98880 0.01117 0.00003
S1_28339 0.99042 0.00956 0.00002
S1_30258 0.99061 0.00937 0.00002
S1_37984 0.99138 0.00860 0.00002
S1_38081 0.99139 0.00860 0.00002
S1_39977 0.99157 0.00841 0.00002
S1_39988 0.99157 0.00841 0.00002
edit: clarified example data as output
Which program did you use that produces this data? Obviously the probabilities that are close to 1 are the likely genotypes. For all of the ones that you've listed, the genotype relates to the first column (probAncestry(1,1). A threshold can be typically 0.90 or 0.95. Do you literally want a script that can parse this and make a call at each line?
PS - double posted: https://stats.stackexchange.com/questions/349738/convert-genotype-probabilities-0-1-into-genotypes-0-1-2
Is double posting against the rules?
It's not breaking the law or anything, but it's more about me bringing attention to others. If an answer was already provided on StackExchange, then no point posting here
RASPberry was the program I used. The output above is just the relevant columns I need, there are a few more columns in the output omitted.
Yes, I do need a script to take the probabilities and make a call for each locus. This individual is unadmixed, so the probabilies for every locus are high; later on in this file the probs = 1 for nearly every locus. However, admixed individuals will have probAncestry(1,2) > 0.5 at some loci & probAncestry(2,2) > 0.5 at other loci. I need genotypes of them.
My current thinking is to use nested ifelse statements, but I like the simplicity of zx8754's answer!
Where is the example input?
Oh yeh, s/he's showing the example output (?)
Not sure, if it is input or expected output. If it is input, then it is as simple as:
meaning AA
nice and simple. I can work that into apply (I think ;)
Yes, neat way of coding it, zx8754.
This solution works well for some loci, but here's an example where it fails. The output should be a 2, but it's a 1.
This kind of a distribution is unlikely to be observed from my data. Usually, if a locus has high probability of BB, then the AA probability is low (ie. ~ 0). But I assume it could happen.
What about just using 0.9 as cut-off? Literally scan over each data-point and, if >0.9, set to some value.
I would do this using AWK or tr command in BASH.