I tried to repeat mask a plant genome for an analysis using the following command
./RepeatMasker -species name of specie -s -a -poly -dir out_file path path_to genomefile.fna
when I'm checking by annotating the masked and unmasked genome I'm getting the same number of genes so i guess that repeat masking is not being carried out properly
Repeatmasking will not remove genes to my knowledge. For that, you'll need to reannotate the masked file (or did you do this?).
Generally, repeat masking will annotate repetitive DNA regions, and either soft-mask (replace with lower case letters by some conventions, eg atgc instead of ATGC) or hard-mask (replace ATGC with NNNN).
Try running Gmap with a transcript set on your original and hard-masked outputs if you want to do a comparison. I wouldn't expect the number of genes/transcripts to change much, since the intergenic DNA is more likely to be repeat-masked.
Thank you for the reply
Sorry for not framing my question properly
I was trying to reduce the number of genes in the genome by masking the repeat regions
I used the complete dfam library not the stock stock small one
But to my surprise both the annotated files before and after masking shows the same number of genes
I wonder if I'm missing a step or is there any mistake in the way which I'm doing it.
which tool are you using for gene prediction? It can be it does not take the masking into account (or that you need a different kind of masking for instance, soft vs hard masking (as @colindaven has said)
hmmm, that one should be able to take masking into account when doing prediction. What kind of masking did you do in your repeatmasker run? I think hardmasking as that is the default for repeatmasker.
You did provide the .masked version of your fasta file as input for the gene-prediction run (after masking)?
where did you add the Dfam DB in your command line?
there should be some report file or such created by repeatmasker in the folder where you ran the cmdline (or what is printed to screen when you run it?)
The path to dfam database was given while configuring repeatmasker
The files created were with extensions .align, .masked, .polyout, .genomic.fasta.tbl, genomic.fasta.out
the .tbl file loked like a report but all of the repetitive elements mentioned were showing number of elements 0%
Except one unclassified: 264bp
What i was able to understand is that the current version of Dfam 3.3 doesn't have a large library of plant TEs.
RepBase has a much larger selection of plant TE families but that needs an subscription (for those who can get it it will be an easy solution)
Thank you for the reply Sorry for not framing my question properly I was trying to reduce the number of genes in the genome by masking the repeat regions I used the complete dfam library not the stock stock small one But to my surprise both the annotated files before and after masking shows the same number of genes I wonder if I'm missing a step or is there any mistake in the way which I'm doing it.
which tool are you using for gene prediction? It can be it does not take the masking into account (or that you need a different kind of masking for instance, soft vs hard masking (as @colindaven has said)
Thank you for the quick response I'm using Augustus for gene prediction
hmmm, that one should be able to take masking into account when doing prediction. What kind of masking did you do in your repeatmasker run? I think hardmasking as that is the default for repeatmasker.
You did provide the .masked version of your fasta file as input for the gene-prediction run (after masking)?
yes I went for the default one I used the masked version as input for gene-prediction
Is there any info on the runtime output of augustus that is is taken the repeat info into account?
What DB did you use to mask the genome? what is the output of repeatmasker in terms of how much sequence it could identify as repeat?
I used the complete dfam database I'm not aware of the run time info i will check if you can tell the file in which that info will be present
where did you add the Dfam DB in your command line?
there should be some report file or such created by repeatmasker in the folder where you ran the cmdline (or what is printed to screen when you run it?)
The path to dfam database was given while configuring repeatmasker
The files created were with extensions .align, .masked, .polyout, .genomic.fasta.tbl, genomic.fasta.out the .tbl file loked like a report but all of the repetitive elements mentioned were showing number of elements 0% Except one unclassified: 264bp
well, there you sort of have your answer: repeatmasker was apparently not able to mask anything.
I usually create a custom-species-specific lib to do the screening with
It's probably not a real solution, but I created a quick and dirty alternative to RepeatMasker for hard masking reference genomes here:
https://github.com/colindaven/blacklister
You could try that with the Dfam database ... at least for testing.
I'm thinking of making a custom TE library using RepeatModeller and use that for running RepeatMasker
hope that works
What i was able to understand is that the current version of Dfam 3.3 doesn't have a large library of plant TEs. RepBase has a much larger selection of plant TE families but that needs an subscription (for those who can get it it will be an easy solution)