I have a nucleotide sequence(fasta format) size limit of 20 kb. And also I have my own genome sequence file(repeat database, also fasta format, 200GB) on my local machine. I want to identify repetitive elements in genome sequence.
Questions:
1)Which software is best? I heard about RepeatMasker.
2)If RepeatMasker will be used, what kind of format for repeat library? I mean do I convert fasta format to some sort of format?
3)What is low-complexity DNA sequences and interspersed repeats?(off topic of course, you don't have to answer it)
First of all, with what purpose do you want to identify repetitive elements in your genome sequence?
If you are just interested in masking them from the genome, I would use RepeatMasker with a repeat database from RepBase www.girinst.org, yes fasta format is OK) to mask transposable elements by similarity to already described transposons.
To identify tandem repeats (typically minisatellites, repeated motifs of 20-50 nuclotides) you can use TRF (http://tandem.bu.edu/trf/trf.html).
TANTAN also identifies tandem repeats, and low-complexity sequences as well (ATATAT for example) (http://www.cbrc.jp/tantan/)
If what you are interested in is identifying and classifying transposable elements in your genome, there are various tools to identify different types of transposons, but that calls for a longer mail... let me know if its the case.
please read the repeat masker help pages and see if this answers your question: http://www.repeatmasker.org/webrepeatmaskerhelp.html
Partly, I still don't know the format of Reference repeat databases. Is a huge fasta file okay?