Hi, I want to know which repeat element is statistically enriched in one sequence compared to the background sequence, how should I perform such a statistic calculate?
For repeat data, I have got bed format repeatmasker from UCSC.
For example, waht should I do if I want to know the enrichment of tandem repeat ā(CAG)nā ?
Thanks.
I just want to learn the statistic method for sequence enrichenment analysis, so to make it simple, waht if I want to know the tandem repeat ā(CAG)nā for example?
Assuming that your UCSC repeatmasker BED file looks like this:
#genoName genoStart genoEnd strand repName repClass repFamily
chr1 16777160 16777470 + AluSp SINE Alu
chr1 25165800 25166089 - AluY SINE Alu
chr1 33553606 33554646 + L2b LINE L2
chr1 50330063 50332153 + L1PA10 LINE L1
chr1 58720067 58720973 - L1PA2 LINE L1
chr1 75496180 75498100 + L1MB7 LINE L1
and you are interested in the repeat elements by family (such as Alu, L1, L2), you can view the problem as sampling repeat elements (with your sequence) from all elements in the genome. The following steps should give you a measure of enrichment along with a p-value.
First use BEDTools to retrieve all rep elements in your sequence from the UCSC BED file.
Then, for each rep element family you found in your seq, count
how often it appears in your seq = s
how often it appears in the genome = g
Then count
how many rep elements are in your seq in total = S
how many rep elements are in the genome in total = G
Then,
f = s/S in the fraction of the
element in your seq
F = g/G is the fraction of the
element in the genome, and
f/F is the enrichment.
To get a p-value for the enrichment, do a Fisher's exact test with s, g, S, and G.
Thanks a lot, is this a generally accepted method of calculating? I think like this, I agree with your s and g, but I think the S and G should be like this (theoretical frequencies rather than just counting all repeat elements): assuming the lenght of repeat is x, and the lengths of my sequence and genome are m and n respectively. S=m/x and G=n/x. what do you think?
For which repeat elements are you looking? Microsatellites or transposable elements?
I just want to learn the statistic method for sequence enrichenment analysis, so to make it simple, waht if I want to know the tandem repeat ā(CAG)nā for example?