Hi all,
So I am trying to develop the correct statistical test in R to determine the following:
I have 1000000 sequences each of five base pair length. Eg:
ATTTG
ATGCG
ATTTT
GCCCT
.
.
.
1000000 items.
Also there are 12 catgories in which these sequences can be divided. So say, 40000 out of 10000000 sequences belong to category1 and so on.
Now I have to develop a statistical test to determine how significant are the presence of nucleotides in each location. To elaborate, say, location 1 of category 1 30% A, 40% G and so on. These percentages are calculated using :
(number of As in location 1)/1000000 etc. So, how significant is the proportion of A at location 1. Is there an overrepresentation of As?
I thought of two ways: One is to calculate the genome average of A,T, G,C. So in hg38 ref genome, I have say, 27% A, 21%G etc for hg38. Now taking category1 (all 40000 sequences) and calculating %A in my location 1 clearly indicates an over-representation at that location. But I am skeptical about this method mainly because it doesn't involve any statistical analysis. Can you please suggest otherwise?
Second, shuffle each nucleotide on a positional basis. Say, I shuffle all the nucleotides in Position 1, similarly for Position 2 and so on for all 10000000 sequences. Now I pick 40000 sequences(for category 1) at random from these already shuffled list and see what percentage of A I get. Similarly for category 2 and so on. Is this correct?
Please suggest otherwise. Thanks in advance for reading such a long question.
Could you tell us why you're doing this please? This reads like creating a plot to represent motifs, like the one here:
Also, please provide feedback on your previous threads, including:
Ok please pardon my ignorance, but how to give feedback to the posts? Should I accept an answer?
By doing one (or more) of the following as appropriate.
If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they all work. These buttons can be found on the left edge of answers/comments.
Thanks for informing.