Is there a comprehensive reference for all the STRs (Short tandem repeats) present in human genome? I looked into few different resources but couldn't get all the repeats in hg38.
For example, I downloaded UCSC simple repeats annotated by Tandem Repeat Finder (http://hgdownload.cse.ucsc.edu/goldenpath/hg38/database/simpleRepeat.txt.gz), but the list misses serval repeats. I downloaded another reference called gangSTR (https://github.com/gymreklab/GangSTR#references) and this one misses some repeats too. There was only about 30% overlap between the 2 databases.
On a slightly different note, I noticed the shortest repeat region length (not repeat unit length) in hg38.simpleRepeat.txt.gz was 25. So any repeat region less than 25 bases are not included in them.
Thanks!
Could you give examples of some missed repeats? It isn't clear how you're defining what is missed
Thank you! For example these regions are not found in the repeat databases -
CHR START STOP SEQ Repeat_Unit Repeat_times
chr1 930278 930287 GGCGGCGGC GGC 3
chr1 939279 939288 CTGCTGCTG CTG 3
chr1 942602 942610 GCGCGCGC GC 4