Hello, I would like some guidance assembling and annotating a eukaryotic genome using only short-reads.
Quick background, I'm fairly new to bioinformatics and no one in my lab is a computational person, so I'm really just trying to figure this out as I go. My project will involve assembling the genomes of a couple non-model plants. I don't have the plants in my possession yet, so I'm just trying to work with available data from SRA for practice. Conveniently, a team in China has done short-read sequencing on one of the plants I intend to study. Unfortunately, there isn't long-read data available, but I will do long-read sequencing when I actually have my plants.
I assembled the short-read data using ABySS, and the summary stats from Quast and ABySS are decent, I think? N50: 16kb L50: 10k N75: 7kb L75: 25k total length: 627Mb (close to the expected genome size)
Now that I have this assembly, I would like to do masking and annotation. For masking I want to use RepeatModeler and RepeatMasker, but when I tried to run RepeatModeler, the job finished in minutes, which seems wrong? I did get a warning that N50 was 7kb and it didn't like that. Not sure why the N50 is different from Quast and ABySS, though. RepeatModeler also finished after only doing round1. Should I go about constructing a repeat library/masking in a different way? I know masking with an assembly this fragmented won't be the best, but I'm not sure if there's something else I should do?
For annotation I'm going to use Braker2 and there is a high-quality assembly of a plant with an LCA ~80Mya, so I think that should help a lot.
Any suggestions on how I should move forward are very welcome and appreciated, thanks!
This is so so helpful, thank you very much!
No worries! I just realised you might not need to run TEsorter, all it does is change the labels/names for the TEs EDTA found. It's useful if you want to analyse what TEs you have, but not useful if you just want to repeatmask your genome.
Hey there, do you know how long EDTA normally takes to run? I started my run on Wednesday, but I've been stuck here since:
Thanks! It looks like I'm using about 40 CPUs currently.
You may be running into this issue: https://github.com/oushujun/EDTA/issues/225
I have, in the past, run a hacky trick where I merged all contigs into sets of <10 fake pseudomolecules with Ns between the contigs, then ran EDTA with that, then ran repeatmasker with the hacky TE database but the database of 'correct' contigs. The de novo TE finding step doesn't care for positions. Some of the EDTA tools make a file for each contig which can result in millions of files, slowing everything down. With my hacky trick there are not many files.
You could also try removing smaller contigs (< 1 Kbp length?).
Thanks for the suggestion! I can see that the .mod.harvest directory is still having files written to it, so I think the issue is probably coming from how fragmented the assembly is. I'll remove the smaller contigs first and rerun. Then I'll try concatenating things if that doesn't help. Thank you!