Hello!
I have a repeat masked file generated with repeats in lowercase using Repeat Masker. I am planning to use Blast2Go for gene prediction and annotation. Can I use the file with nucleotides in lower case (repeats) or do I neeed to convert the repeats from lowercase to NNNNs.
I personally advice to use the sequence with lower case masking as this keeps some level of info (compared to hard masking), on the other hand I'm not sure if all gene prediction software understand this.
Moreover, Blast2GO is not a gene prediction tool , it's only used to assign GO labels to already predicted/annotated genes. You'l first will need to run a true gene prediction tool such as eg. Augustus, EuGene, GeneMark, ...
You can also try https://www.girinst.org/censor/ for repeating masking of your sequence with available templates from the species of interest.
If you have repeat, you need to find out, whether they are Terminal Inverted Repeat with Target Site Duplication, or Palindromes. If you have TIR with TSD, probably it is a signal for DNA transposon (autonomous or non-autonomous, if the element is small around 50-500bp it would MITE (Miniature Inverted Repeat Transposable Element)). You can try to annotate your sequence, first which kind of repeats you found. You can use tools like einverted repeat from EMBOSS. Then go for gene model after finding correct protein frame from ORF finder (https://www.ncbi.nlm.nih.gov/orffinder/) and Splign (https://www.ncbi.nlm.nih.gov/sutils/splign/splign.cgi?textpage=online&level=form) etc. Even you can use NCBI BLASTN for ortholog gene finding from available gene models, use discontiguous megablast for dissimilar sequences as well.
Blast2GO pro has an inbuilt function of gene prediction using Augustus.
Thank you so much for guidance.