I am trying to get a comprehensive list of simple repeats (mono-, di-, tri-, tetra-) in the human genome (hg19). I have downloaded the simpleRepeat.txt.gz from UCSC, but seems it is missing some of the repeats we are interested in. For example, chr1:981861-981868[CCCCCCCC], chr1:1116223-1116230[GGGGGGGG] are some mono nucleotide repeats we are interested in looking at, but they are not on the UCSC list. Thus, I was trying to generate a list using TRF, but still, some of the repeats I was trying to get did not get reported by TRF, e.g., chr1:981861-981868[CCCCCCCC], with the default parameters. Can someone provide some insights here:
- Is there any place where I can download a really 'comprehensive' simple repeats list from?
- If no to question #1, what would be the best way to curate such a list? Is running tools like TRF or RepeatMasker a good idea?
- If TRF is something you would suggest, how should I make it report these mono-nucleotide repeats that I was missiong with the default parameters?
Thanks
Have you looked at the UCSC Table browser? Check in the group "Repeats". There are multiple options available that you can download the data for.