How to extract all the simple repeats from the hg19 reference genome

0

Entering edit mode

7.7 years ago

Jackie ▴ 70

I am trying to get a comprehensive list of simple repeats (mono-, di-, tri-, tetra-) in the human genome (hg19). I have downloaded the simpleRepeat.txt.gz from UCSC, but seems it is missing some of the repeats we are interested in. For example, chr1:981861-981868[CCCCCCCC], chr1:1116223-1116230[GGGGGGGG] are some mono nucleotide repeats we are interested in looking at, but they are not on the UCSC list. Thus, I was trying to generate a list using TRF, but still, some of the repeats I was trying to get did not get reported by TRF, e.g., chr1:981861-981868[CCCCCCCC], with the default parameters. Can someone provide some insights here:

Is there any place where I can download a really 'comprehensive' simple repeats list from?
If no to question #1, what would be the best way to curate such a list? Is running tools like TRF or RepeatMasker a good idea?
If TRF is something you would suggest, how should I make it report these mono-nucleotide repeats that I was missiong with the default parameters?

Thanks

simple repeats reference genome • 4.5k views

ADD COMMENT • link updated 7.7 years ago by Alex Reynolds 36k • written 7.7 years ago by Jackie ▴ 70

0

Entering edit mode

Have you looked at the UCSC Table browser? Check in the group "Repeats". There are multiple options available that you can download the data for.

ADD REPLY • link 7.7 years ago by GenoMax 150k

3

Entering edit mode

7.7 years ago

Pierre Lindenbaum 166k

the following C program will find the simple repeats:

	#include <stdio.h>
	#include <stdlib.h>
	#include <ctype.h>

	#define SEQNAME_MAX BUFSIZ
	#define REPORT if(len_repeat>5) printf("%s\t%d\t%d\t%c[%d]\n",name,pos-len_repeat,pos,prev_c,len_repeat); len_repeat=0;
	int main(int argc,char** argv)
	{
	int c;
	int pos=0;
	int prev_c=-1;
	char name[SEQNAME_MAX];
	name[0]=0;
	int len_repeat=0;
	for(;;)
	{
	switch((c=fgetc(stdin)))
	{
	case EOF: return EXIT_SUCCESS;
	case '>':
	{
	int space=0;
	int name_length=0;
	REPORT;
	name[0]=0;
	pos=0;
	while((c=fgetc(stdin))!=EOF && c!='\n')
	{
	if(space) continue;
	if(isspace(c)) { space=1; continue;}
	name[name_length++]=c;
	}
	name[name_length]=0;
	prev_c=-1;
	len_repeat=0;
	break;
	}
	case '\n':case '\r':case ' ':break;
	case 'a': case 'A':
	case 't': case 'T':
	case 'g': case 'G':
	case 'c': case 'C':
	{
	c= toupper(c);
	if(prev_c==c)
	{
	++len_repeat;
	}
	else
	{
	REPORT;
	}
	prev_c=c;
	++pos;
	break;
	}
	default:prev_c=c; ++pos;break;
	}
	}
	return EXIT_SUCCESS;
	}

view raw biostar267241.c hosted with ❤ by GitHub

compile

gcc biostar267241.c

example:

curl -s "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr1.fa.gz" | gunzip -c | ./a.out | grep -E '(981861|1116223)'
chr1    981861  981868  C[7]
chr1    1116223 1116230 G[7]

ADD COMMENT • link 7.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Thank you so much for posting the C program. It seems to work perfect, but I have another question. Does this program find only mono- repeats or any repeats with total len >5bp?

ADD REPLY • link 7.7 years ago by Jackie ▴ 70

0

Entering edit mode

mono-repeat (same base) of len > 5

ADD REPLY • link 7.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted. Upvote|Bookmark|Accept

ADD REPLY • link 7.7 years ago by WouterDeCoster 47k

2

Entering edit mode

7.7 years ago

Alex Reynolds 36k

You could download repeats for hg19 from the Repeatmasker folks and convert to BED with convert2bed to do set operations:

$ wget -qO- http://www.repeatmasker.org/genomes/hg19/RepeatMasker-rm405-db20140131/hg19.fa.out.gz | gunzip -c - | convert2bed --input=rmsk - > hg19.fa.out.bed

You could do ad-hoc searches with bedops, piping in your region of interest:

$ echo -e 'chr1\t981861\t981868' | bedops -e 1 hg19.fa.out.bed -

Or pass in a file of regions of interest:

$ bedops -e 1 hg19.fa.out.bed roi.bed > answer.bed

Perhaps you could use this with Pierre's binary to construct results with simple repeats and more complex repeat hits.

ADD COMMENT • link 7.7 years ago by Alex Reynolds 36k

0

Entering edit mode

Thank you, Alex, that's a great resource, and I have downloaded the repeat masker (RM) file, I think combining the list generated using Pierre's code with this file will give a good starting list.

However, I am still trying to understand why, even this RM file is missing some simple repeats, e.g., a trinucleotide repeat chr1:6680069-6680085 [GAA]n. For those of you who understands RM well, is there some criteria for a simple repeat to be included in the final RM list? e.g., copy number of the unit needs to be >=10, or something like that? as most of these 'longer repeats' are all present in the RM file.

ADD REPLY • link 7.7 years ago by Jackie ▴ 70

0

Entering edit mode

It's unclear to me what parameters were used to generate these files. The best people to ask would probably be the Repeatmasker folks.

ADD REPLY • link 7.7 years ago by Alex Reynolds 36k

Login before adding your answer.