How To Find The Enriched Repeat Elements Between Two Sequences
2
0
Entering edit mode
12.7 years ago
Free Man ▴ 180

Hi, I want to know which repeat element is statistically enriched in one sequence compared to the background sequence, how should I perform such a statistic calculate?
For repeat data, I have got bed format repeatmasker from UCSC.
For example, waht should I do if I want to know the enrichment of tandem repeat ā€œ(CAG)nā€ ?
Thanks.

repeats sequence enrichment • 5.0k views
ADD COMMENT
0
Entering edit mode

For which repeat elements are you looking? Microsatellites or transposable elements?

ADD REPLY
0
Entering edit mode

I just want to learn the statistic method for sequence enrichenment analysis, so to make it simple, waht if I want to know the tandem repeat ā€œ(CAG)nā€ for example?

ADD REPLY
7
Entering edit mode
12.7 years ago
Christof Winter ★ 1.0k

Assuming that your UCSC repeatmasker BED file looks like this:

#genoName    genoStart    genoEnd    strand    repName    repClass    repFamily
chr1    16777160    16777470    +    AluSp    SINE    Alu
chr1    25165800    25166089    -    AluY    SINE    Alu
chr1    33553606    33554646    +    L2b    LINE    L2
chr1    50330063    50332153    +    L1PA10    LINE    L1
chr1    58720067    58720973    -    L1PA2    LINE    L1
chr1    75496180    75498100    +    L1MB7    LINE    L1

and you are interested in the repeat elements by family (such as Alu, L1, L2), you can view the problem as sampling repeat elements (with your sequence) from all elements in the genome. The following steps should give you a measure of enrichment along with a p-value.

First use BEDTools to retrieve all rep elements in your sequence from the UCSC BED file.

Then, for each rep element family you found in your seq, count

  • how often it appears in your seq = s

  • how often it appears in the genome = g

Then count

  • how many rep elements are in your seq in total = S

  • how many rep elements are in the genome in total = G

Then,

  • f = s/S in the fraction of the element in your seq

  • F = g/G is the fraction of the element in the genome, and

  • f/F is the enrichment.

To get a p-value for the enrichment, do a Fisher's exact test with s, g, S, and G.

ADD COMMENT
0
Entering edit mode

Thanks a lot, is this a generally accepted method of calculating?
I think like this, I agree with your s and g, but I think the S and G should be like this (theoretical frequencies rather than just counting all repeat elements): assuming the lenght of repeat is x, and the lengths of my sequence and genome are m and n respectively. S=m/x and G=n/x.
what do you think?

ADD REPLY
0
Entering edit mode

Yes, you could as well use sequence lengths instead of simple counts. Not sure if there is a generally accepted method.

ADD REPLY
0
Entering edit mode
13 months ago
guliar • 0

Slc1a2 Plpp3 Sfxn5 Pitpnc1 Cst3 Itih3 Phactr1 Tra2a Phkg1 Zfp949 Adrbk2 Polr2a Guf1 A930015D03Rik Slc4a4 Slc25a21 Slc6a11 Fgf14 Abca1 Chuk Zfp36l1 Slc7a11 Gabbr1 Msmo1 Cspg5 Camk2g Sgcd Cdh19 Igf2bp3 Galnt16 Clybl Tprkb Plp1 1700112E06Rik Gm4876 Meis1 Mtss1l 9330159F19Rik Vegfa L3mbtl3 Mgat5 Kcnj10 Arpp21 Dlg2 Robo2 Arhgef10l Nrg1 Ptn Hes5 Pcyt2 Ednrb Adra1a Gabra2 Clu Phyhipl Cables1 Emx2os Caskin1 Ptch1 Nav3 Nnat Lrig1

ADD COMMENT

Login before adding your answer.

Traffic: 2277 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6