how do I run repeat masker

Entering edit mode

9.4 years ago

ksi216 ▴ 80

Hello, I'm new to unix and and I installed repeat masker but I'm clueless as to the commands that I enter to run it. Thanks

repeatmasker • 25k views

ADD COMMENT • link updated 15 months ago by Andrzej Zielezinski 11k • written 9.4 years ago by ksi216 ▴ 80

Entering edit mode

Why would you want to run RepeatMasker? I'm asking this honestly, because I have no idea why people do this. It generally seems like a bad idea. So, what are you trying to accomplish?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by Brian Bushnell 20k

Entering edit mode

Repeat identification and masking is usually the first step in the genome annotation. Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations. Worse still, many transposon open reading frames (ORFs) look like true host genes to gene predictors (e.g. FGENESH, Augustus, GENSCAN, SNAP), causing portions of transposon ORFs to be added as additional exons to gene predictions, completely corrupting the final gene annotations. Good repeat masking is crucial for the accurate annotation of protein-coding genes.

ADD REPLY • link 9.4 years ago by Andrzej Zielezinski 11k

Entering edit mode

Thanks for the explanation. That's roughly the same as what the JGI annotation group tells me. But I still don't understand it. What do you mean by "repeats can seed millions of spurious BLAST alignments"? In my opinion... that actually means, there are millions of legitimate alignments that you wish to ignore, because that would be convenient for your publication. How do you decide an alignment is something you want to ignore?

As far as I can tell, repeat-masking is something people do so that current inadequate software produces sort-of-reasonable output. At JGI, we transitioned from Illumina fungal assemblies to PacBio assemblies. The PacBio assemblies are vastly more accurate because they can correctly resolve long repeats. Initially, the fungal annotation group hated these new PacBio assemblies, because they contain repeats, and broke their current software. But now, they are adjusting, because they finally understand that the PacBio assemblies are actually the truth (or, at least, closer to the truth) compared to assemblies based on short Illumina reads.

I believe that masking is very useful for conservative contaminant removal, to ensure that there is no possibility of false-positive contaminant identification. But running RepeatMasker is asinine. It sounds like people want to run it to speed up their BLAST searches, or use it to filter out legitimate hits that are inconvenient.

If you are a legitimate researcher, you need to examine all hits. If you publish a paper saying "The top hit was X, therefore X has the greatest effect on Y", great! But, if you state that, based on mapping to repeat-masked genomes, then your results may be valid, and they may not be valid. It depends on things outside of your control.

Personally, I think RepeatMasker is a piece of crap. Normally, when I feel this way, I write a superior alternative. But in this case I feel that RepeatMasker is a detriment to humanity and should be extinguished. It would certainly be nice if genomes contained no repeats. But, they do; repeats are important and need to be dealt with, rather than ignored or masked. There are a lot of people who annotate genomes, and obviously, it would be easier for them if all genomes were repeat-free, and had no transposons, etc. But that's not the real world. In the real world, people have to annotate actual assemblies, that contain actual repeats. It's nice to live in an imaginary world of Illumina assemblies that have a maximum read length of 300bp. But the modern world has PacBio 30kbp sequences, and can correctly assemble organisms containing very long repeats.

So - in my opinion, RepeatMasker is a great tool for people with no bioinformatics knowledge, who want to publish massive amounts of crap, and could not care less about directing future scientists. If you actually care about the real world, please use real unmasked data.

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by Brian Bushnell 20k

Entering edit mode

You don't seem to understand the comment you responded to at all or the point of repeat-masking for gene annotation. The real problem with this type of viewpoint is that you don't understand the biology in the first place, then you develop yet another undocumented, untested tool that you alone deem as being "superior" while in reality, it is of no actual use to biology. I'm not making a personal statement, so don't be offended. This is my experience in working at numerous institutes. A lot people say things like this and write tools that are faster but the rationale behind the approach is complete nonsense. I really worry about this issue because biologists often don't think about the tools they are using.

In most plants, the largest ORFs are from transposons, which may carry their own internal promoters and have numerous coding domains. I've spent my entire academic career to studying transposons and I can tell you that it is very difficult to distinguish host genes from transposons, especially in non-model systems. If you write a tool that is superior for this purpose, I'll be the first to use it. I'll add that I don't use RM for repeat identification, but the masking approach is sound.

edit: Think about it this way, RM is probably them most ubiquitous tool in bioinformatics behind BLAST, is that because everyone in biology doesn't have an idea what they are doing or could it be that you don't understand? I'm not trying to be argumentative, and I agree with you that RM could be better, but the approach is obviously robust and supported by decades of experimentation.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.4 years ago by SES 8.6k

Entering edit mode

You don't seem to understand the comment you responded to at all or the point of repeat-masking for gene annotation. The real problem with this type of viewpoint is that you don't understand the biology in the first place,

That is absolutely correct, which is why I clearly stated that I don't understand the point of repeat-masking. And I agree, I don't understand the biology behind it, either. Is there a good reason for repeat-masking? Maybe! But I have yet to hear it, and it certainly has not been described in this thread.

I have talked people that like to do masking prior to annotation, and they were unable to provide an informative description of what they want to mask, or why they want to mask it. Why is that? Basically, they use some broken software that gives incorrect results when it's run on a good assembly.

If you want correct answers, the solution is not to use random pieces of ancient software that mask a huge portion of your reference... but, rather, to map to everything, and see what your sequences map to best.

It is not prudent to unquestioningly use some protocol or software just because lots of other people use it.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.4 years ago by Brian Bushnell 20k

Entering edit mode

Consider the first genome papers that came out around 15 years ago, which were Arabidopsis and human. Both predicted about 100k genes in each species. This was a gross overestimate due to the lack of computational tools to identify transposons and to mask them prior to gene prediction. There are many papers on this subject. Yes, long reads will help resolve repeats, but not solve the problem, nor help our current assemblies we have to study now. I'm committed to working on this problem but it is very challenging.

The major complication in annotation is that many TEs insert into genes, and in fact, all human genes have Alu insertions. It is very difficult to identify these events, and because TEs make up the major of DNA on the planet, that makes masking a very import task. Thus, RM is not surprisingly an important tool. As I said before, I'm not a huge advocate of this specific software, but it is maintained by a team of developers, has great documentation and it works on every OS. To say it is ancient and is 'crap' is off-base and makes you sound kind of ridiculous. Repeat identification, and masking, is far from a solved problem and being disparaging about current approaches doesn't really help. I'd be happy to discuss approaches for better tools, or explain the limitations of the current tools, because that is what I work on.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.4 years ago by SES 8.6k

Entering edit mode

all human genes have Alu insertions

Oh, come on. I've worked with human genetics, and that's not true. Unless you mean "there exists a human somewhere with this mutation in a specific gene", which is irrelevant to the human genome.

Even if it was true, gene annotation software should simply deal repeated elements, rather than requiring them to be masked.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.4 years ago by Brian Bushnell 20k

Entering edit mode

Please do a basic web search of the topics mentioned above.

ADD REPLY • link 9.4 years ago by SES 8.6k

Entering edit mode

No matter how pervasive repeats may be, that does not excuse masking them prior to annotation. If an annotation program cannot handle repeated sequences, then the program needs to be improved.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.4 years ago by Brian Bushnell 20k

Entering edit mode

Are you referring to a tool that could predict TEs and genes? That would be great in theory, but the complexity of this task is enormous.

It is a bit puzzling why you are clinging to this idea about repeat masking. You asked a legitimate question about masking, and you got very clear answers. But, it sounds like you have yourself convinced of this opinion on the subject and you have chosen not to believe anyone. Try to keep an open mind would be my suggestion. There is decades of research to support these approaches, which is what I was trying to express in my last comment (though it was a bit terse and could have been stated better). If you can find any evidence to support your view, that would be justification in my mind for discussing alternative approaches. I'd be happy to discuss this further if there is some tangible reason you can provide for not masking, but otherwise this would not be productive. Cheers.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.4 years ago by SES 8.6k

Entering edit mode

I thought I would provide some links for the sake of discussion and try to explain the issue better, since this is what I study. Here is a good paper on the subject: Consistent over-estimation of gene number in complex plant genomes. The section on repeat masking in the Maker documentation also describes some of the reasons mentioned above about the need to mask genomes prior to gene annotation. In addition to being transcribed in many species, TEs have many hallmark features of host genes, they may contain gene fragments, and they insert into genes and other TEs. This creates a complex landscape in the genome, which is far from being random but it presents enormous computational and biological challenges. The main issue with gene annotation is not "repeat" sequences in a mathematical sense. The issue is with biological features that appear unique and contain protein and transcriptome support, ORFs, promoters, etc. The result is that gene number is going to be over-estimated by a long shot if these factors are not taken into account.

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.4 years ago by SES 8.6k

Entering edit mode

thanks I got it to work

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by ksi216 ▴ 80

Entering edit mode

9.4 years ago

Andrzej Zielezinski 11k

Running RepeatMasker is pretty straightforward:

RepeatMasker --species arabidopsis yoursequence.fasta

To see a full list of options run: RepeatMasker -h

	NAME
	RepeatMasker - Mask repetitive DNA

	SYNOPSIS
	RepeatMasker [-options] <seqfiles(s) in fasta format>

	DESCRIPTION
	The options are:

	-h(elp)
	Detailed help

	Default settings are for masking all type of repeats in a primate
	sequence.

	-e(ngine) [crossmatch\|wublast\|abblast\|ncbi\|hmmer\|decypher]
	Use an alternate search engine to the default.

	-pa(rallel) [number]
	The number of processors to use in parallel (only works for batch
	files or sequences over 50 kb)

	-s Slow search; 0-5% more sensitive, 2-3 times slower than default

	-q Quick search; 5-10% less sensitive, 2-5 times faster than default

	-qq Rush job; about 10% less sensitive, 4->10 times faster than default
	(quick searches are fine under most circumstances) repeat options

	-nolow /-low
	Does not mask low_complexity DNA or simple repeats

	-noint /-int
	Only masks low complex/simple repeats (no interspersed repeats)

	-norna
	Does not mask small RNA (pseudo) genes

	-alu
	Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

	-div [number]
	Masks only those repeats < x percent diverged from consensus seq

	-lib [filename]
	Allows use of a custom library (e.g. from another species)

	-cutoff [number]
	Sets cutoff score for masking repeats when using -lib (default 225)

	-species <query species>
	Specify the species or clade of the input sequence. The species name
	must be a valid NCBI Taxonomy Database species name and be contained
	in the RepeatMasker repeat database. Some examples are:

	-species human
	-species mouse
	-species rattus
	-species "ciona savignyi"
	-species arabidopsis

	Other commonly used species:

	mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
	danio, "ciona intestinalis" drosophila, anopheles, elegans,
	diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize

	Contamination options

	-is_only
	Only clips E coli insertion elements out of fasta and .qual files

	-is_clip
	Clips IS elements before analysis (default: IS only reported)

	-no_is
	Skips bacterial insertion element check

	Running options

	-gc [number]
	Use matrices calculated for 'number' percentage background GC level

	-gccalc
	RepeatMasker calculates the GC content even for batch files/small
	seqs

	-frag [number]
	Maximum sequence length masked without fragmenting (default 60000,
	300000 for DeCypher)

	-nocut
	Skips the steps in which repeats are excised

	-noisy
	Prints search engine progress report to screen (defaults to .stderr
	file)

	-nopost
	Do not postprocess the results of the run ( i.e. call ProcessRepeats
	). NOTE: This options should only be used when ProcessRepeats will
	be run manually on the results.

	output options

	-dir [directory name]
	Writes output to this directory (default is query file directory,
	"-dir ." will write to current directory).

	-a(lignments)
	Writes alignments in .align output file

	-inv
	Alignments are presented in the orientation of the repeat (with
	option -a)

	-lcambig
	Outputs ambiguous DNA transposon fragments using a lower case name.
	All other repeats are listed in upper case. Ambiguous fragments
	match multiple repeat elements and can only be called based on
	flanking repeat information.

	-small
	Returns complete .masked sequence in lower case

	-xsmall
	Returns repetitive regions in lowercase (rest capitals) rather than
	masked

	-x Returns repetitive regions masked with Xs rather than Ns

	-poly
	Reports simple repeats that may be polymorphic (in file.poly)

	-source
	Includes for each annotation the HSP "evidence". Currently this
	option is only available with the "-html" output format listed
	below.

	-html
	Creates an additional output file in xhtml format.

	-ace
	Creates an additional output file in ACeDB format

	-gff
	Creates an additional Gene Feature Finding format output

	-u Creates an additional annotation file not processed by
	ProcessRepeats

	-xm Creates an additional output file in cross_match format (for
	parsing)

	-fixed
	Creates an (old style) annotation file with fixed width columns

	-no_id
	Leaves out final column with unique ID for each element (was
	default)

	-e(xcln)
	Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in
	the query

	SEE ALSO
	Crossmatch, ProcessRepeats

	COPYRIGHT
	Copyright 2007-2012 Arian Smit, Institute for Systems Biology

	AUTHORS
	Arian Smit <asmit@systemsbiology.org>

	Robert Hubley <rhubley@systemsbiology.org>

view raw biostars-170029.txt hosted with ❤ by GitHub

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.4 years ago by Andrzej Zielezinski 11k

Entering edit mode

it says RepeatMasker : Command not found

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by ksi216 ▴ 80

Entering edit mode

It's seems the RepeatMasker is not exported to your PATH. Try: /usr/local/RepeatMasker/RepeatMasker -h.

ADD REPLY • link 9.4 years ago by Andrzej Zielezinski 11k

Entering edit mode

now it says no such file or directory. did I install it wrong?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by ksi216 ▴ 80

Entering edit mode

http://postimg.org/image/us5q6444t/

ADD REPLY • link 9.4 years ago by ksi216 ▴ 80

Entering edit mode

Okay, stay in ~/BI7533/RepeatMasker and run as this: ./RepeatMasker -h

ADD REPLY • link 9.4 years ago by Andrzej Zielezinski 11k

Entering edit mode

Hello and thanks for the nice software (I assume you are the developer of it?).

Can I run RepeatMasker on already masked fasta sequence file? It was, presumably, masked by RepeatMasker before by the ENSEMBL people. I only want to do this because I want to obtain a gff/gtf file of these masked sequences, which would have normally been produced by the ENSEMBL people during their application of RepeatMasker on the toplevel assembly genome, but sadly they don't provide it on their FTP severs.

Or do I have to re-do it from the very beginning: from the vanilla toplevel genome assembly??

ADD REPLY • link 16 months ago by e.r.zakiev ▴ 260

Entering edit mode

Hi! I am not the developer of RepeatMasker. Unfortunately, ENSEMBL does not provide information on the masked sequences. To get it you would need to re-run RepeatMasker on the top-level assembly genome.

ADD REPLY • link 15 months ago by Andrzej Zielezinski 11k