Greetings, I tried to use RepeatMasker to identify repeat sequences, based on the log file, it looks like completed, but the overall result looks weird to me
==================================================
file name: asm.contigs.filtered.fasta
sequences: 1499
total length: 638024656 bp (638024656 bp excl N/X-runs)
GC level: 27.03 %
bases masked: 36845866 bp ( 5.77 %)
==================================================
number of length percentage
elements* occupied of sequence
--------------------------------------------------
Retroelements 0 0 bp 0.00 %
SINEs: 0 0 bp 0.00 %
Penelope 0 0 bp 0.00 %
LINEs: 0 0 bp 0.00 %
CRE/SLACS 0 0 bp 0.00 %
L2/CR1/Rex 0 0 bp 0.00 %
R1/LOA/Jockey 0 0 bp 0.00 %
R2/R4/NeSL 0 0 bp 0.00 %
RTE/Bov-B 0 0 bp 0.00 %
L1/CIN4 0 0 bp 0.00 %
LTR elements: 0 0 bp 0.00 %
BEL/Pao 0 0 bp 0.00 %
Ty1/Copia 0 0 bp 0.00 %
Gypsy/DIRS1 0 0 bp 0.00 %
Retroviral 0 0 bp 0.00 %
DNA transposons 0 0 bp 0.00 %
hobo-Activator 0 0 bp 0.00 %
Tc1-IS630-Pogo 0 0 bp 0.00 %
En-Spm 0 0 bp 0.00 %
MuDR-IS905 0 0 bp 0.00 %
PiggyBac 0 0 bp 0.00 %
Tourist/Harbinger 0 0 bp 0.00 %
Other (Mirage, 0 0 bp 0.00 %
P-element, Transib)
Rolling-circles 0 0 bp 0.00 %
Unclassified: 0 0 bp 0.00 %
Total interspersed repeats: 0 bp 0.00 %
Small RNA: 1043 2107641 bp 0.33 %
Satellites: 0 0 bp 0.00 %
Simple repeats: 540298 28134835 bp 4.41 %
Low complexity: 130176 6603390 bp 1.03 %
==================================================
* most repeats fragmented by insertions or deletions
have been counted as one element
The query species was assumed to be phormia
RepeatMasker Combined Database: Dfam_3.1
run with rmblastn version 2.10.0+
here is my script:
#!/bin/bash
#SBATCH --qos pq_mdegenna
#SBATCH --account iacc_mdegenna
#SBATCH --partition IB_16C_96G
#SBATCH -n 16
#SBATCH -N 1
#SBATCH --output=log
module load RepeatMasker-4.1.0
RepeatMasker -qq -pa 30 -species Phormia /scratch/mdegenna/slin023/star/asm.contigs.filtered.fasta
Did I do anything wrong? What does this result suggest? Because there should be tons of repeat sequences in eukaryotic organisms, any feedbacks and suggestion are welcomed
It looks like you're using the repeats from the Dfam database. Is your genome from a species that is present in Dfam (or closely related to one)? If not, perhaps the evolutionary distance between your species and the species in Dfam is making it hard to accurately identify your TEs?
When I run RepeatMasker on a new genome, I typically first build a library of repeats from the genome assembly and use that for repeat masking instead of relying on a prebuilt database.