I have two sets of sequences ( >1000 sequences in every set; sequence length varies from 1000bp to 100000bp) and tandem repeat hits in every sequence. I would like to test hypothesis that the first set is enriched in tandem repeat.
Example:
Set_1_Seq_1 NNNACGTACGTNACGTNNN...
Set_1_Seq_2 ACGTACGTNACGTNNNN...
Set_1_Seq_3 NACGTACGTACGTNNN...
...
Set_2_Seq_1 NNNNACGTACGTNNN...
Set_2_Seq_2 NNNNNNNNNNNNNNN...
Set_2_Seq_3 NNNACGTACGTNNNN...
Tandem repeat unit: ACGT
How can I test if Set_1 is enriched in tandem repeat compared to Set_2?
My ways of doing this:
- Count how many Set_1 sequences have/don't have tandem repeat; Count how many Set_2 sequences have/don't have tandem repeat.
Use Fisher test. - Count how many times repeat appears per sequence in every set; Compare such hits per sequences between sets.
(For a given example above that would be: Set_1:3,3,3; Set_2:2,0,2).
What test I could use for such comparison? - Calculate percentage of every sequence covered with tandem repeat; Compare percentage of coverage.
What test I could use for such comparison?
Example of data table:
Seq_name Length Contains repeat (0/1) Times of repeat Coverage with repeat (%)
Set_1_Seq1 1000 1 20 8
Set_1_Seq2 2000 1 50 10
Set_1_Seq3 18000 1 1000 22
...
Set_2_Seq1 100000 1 20 0.4
Set_2_Seq2 5000 0 0 0
Set_2_Seq3 10000 0 0 0
...
My question is - How can I test enrichment for a given tandem repeat between to sets of sequences?
- Is it ok to use Fisher test for solution 1?
- What test I could use for solution 2/3?
I really hope someone will help me with this.
PS.:
Similar question was asked how to find the enriched repeat elements between two sequences , but Fisher test don't take number of repeats into account.
Edit.
Nice example of repeat enrichment per set of sequences (Relationship of repetitive elements to EZH2 sites from 22948768). In Figure A they calculated odds for different sites, but from the article or supplements I can't understand how they did this.
What paper is that figure from?
Spreading of X chromosome inactivation via a hierarchy of defined Polycomb stations (Pinter et al, Genome Res. 2012)
Did you manage to find out more on this type of analysis? If yes, can you please share it?
What do you want to know?