Hello all. I have some biological data that I am analyzing, and it requires me to work with multinomial distributions with large numbers (n, k > 20). By analogy, the sorts of problems I'm looking to answer are:
Imagine you have an urn with marbles of 50 different colors, each with differing relative frequencies. You are only interested in 20 pre-specified colors of marbles. Imagine you draw 30 marbles from the urn. What is the likelihood that both (1) all the marbles you draw are one of the pre-specified colors and (2) you draw at least one marble of each color.
I am computing this now by individually finding all possible ways you could draw 30 marbles and see only and all of the 20 pre-specified colors (>200,000,000 ways), calculating the probability of each way individually with a multinomial distribution, and adding all these probabilities up. This works but is extremely slow, and I can't use numbers any larger than these without using all the memory on my computer.
Is there any way to compute this more efficiently? In particular I've wondered if you could calculate P(all marbles you draw are one of the pre-specified colors) with a binomial distribution and then somehow calculate P(you draw at least one marble of each color | all the marbles you draw are one of the pre-specified colors)... but I'm not sure how to calculate the latter probability
This is, admittedly, a \confusing thought experiment... let me know if anything about it could be clarified in a way that would help you understand the problem!
That is incredibly helpful and makes sense, thank you! I'll figure out how to program this and make sure that it matches the results that I expect.
Thanks again!