Question

Statistics: Tandem Repeat Enrichment Between Two Sets Of Sequences

2

Entering edit mode

12.1 years ago

PoGibas 5.1k

I have two sets of sequences ( >1000 sequences in every set; sequence length varies from 1000bp to 100000bp) and tandem repeat hits in every sequence. I would like to test hypothesis that the first set is enriched in tandem repeat.
Example:

 Set_1_Seq_1   NNNACGTACGTNACGTNNN...
 Set_1_Seq_2   ACGTACGTNACGTNNNN...  
 Set_1_Seq_3   NACGTACGTACGTNNN...  
 ...
 Set_2_Seq_1   NNNNACGTACGTNNN...  
 Set_2_Seq_2   NNNNNNNNNNNNNNN...   
 Set_2_Seq_3   NNNACGTACGTNNNN...   

 Tandem repeat unit: ACGT

How can I test if Set_1 is enriched in tandem repeat compared to Set_2?

My ways of doing this:

Count how many Set_1 sequences have/don't have tandem repeat; Count how many Set_2 sequences have/don't have tandem repeat.
Use Fisher test.
Count how many times repeat appears per sequence in every set; Compare such hits per sequences between sets.
(For a given example above that would be: Set_1:3,3,3; Set_2:2,0,2).
What test I could use for such comparison?
Calculate percentage of every sequence covered with tandem repeat; Compare percentage of coverage.
What test I could use for such comparison?

Example of data table:

Seq_name       Length       Contains repeat (0/1)       Times of repeat       Coverage with repeat (%)   
Set_1_Seq1      1000                 1                       20                         8  
Set_1_Seq2      2000                 1                       50                         10  
Set_1_Seq3      18000                1                       1000                       22  
...  
Set_2_Seq1      100000               1                       20                         0.4
Set_2_Seq2      5000                 0                       0                          0 
Set_2_Seq3      10000                0                       0                          0  
...

My question is - How can I test enrichment for a given tandem repeat between to sets of sequences?
- Is it ok to use Fisher test for solution 1?
- What test I could use for solution 2/3?

I really hope someone will help me with this.

PS.:
Similar question was asked how to find the enriched repeat elements between two sequences , but Fisher test don't take number of repeats into account.

Edit.
Nice example of repeat enrichment per set of sequences (Relationship of repetitive elements to EZH2 sites from 22948768). In Figure A they calculated odds for different sites, but from the article or supplements I can't understand how they did this.

Relationship of repetitive elements to EZH2 sites from 22948768.

statistics enrichment • 3.9k views

ADD COMMENT • link updated 12.1 years ago by matted 7.8k • written 12.1 years ago by PoGibas 5.1k

0

Entering edit mode

What paper is that figure from?

ADD REPLY • link 12.1 years ago by Gww ★ 2.7k

0

Entering edit mode

Spreading of X chromosome inactivation via a hierarchy of defined Polycomb stations (Pinter et al, Genome Res. 2012)

ADD REPLY • link 12.1 years ago by PoGibas 5.1k

0

Entering edit mode

Did you manage to find out more on this type of analysis? If yes, can you please share it?

ADD REPLY • link 11.6 years ago by roll ▴ 350

0

Entering edit mode

What do you want to know?

ADD REPLY • link 11.6 years ago by PoGibas 5.1k

score 1 · Answer 1 · 2013-06-10

There are a lot of reasonable ways to attack these problems, but my personal bias would be to assess the significance of all results with permutation tests. The basic idea is you pick any test statistic you like (you gave solid choices as your #1, #2, and #3), then measure it on many permuted versions of the dataset along with the original dataset. In your case, permuted version means shuffling the labels of "Set 1" and "Set 2". You pick a significance threshold from the empirical distribution you get from analyzing the permuted datasets. With this, you only have to worry about the samples being exchangeable under the null hypothesis, as opposed to stronger assumptions I'd think you'd have to make to apply specific parametric tests.