Question

Eukaryotic assembly and annotation with only short-reads, no reference available.

0

Entering edit mode

3.1 years ago

grevanksi • 0

Hello, I would like some guidance assembling and annotating a eukaryotic genome using only short-reads.

Quick background, I'm fairly new to bioinformatics and no one in my lab is a computational person, so I'm really just trying to figure this out as I go. My project will involve assembling the genomes of a couple non-model plants. I don't have the plants in my possession yet, so I'm just trying to work with available data from SRA for practice. Conveniently, a team in China has done short-read sequencing on one of the plants I intend to study. Unfortunately, there isn't long-read data available, but I will do long-read sequencing when I actually have my plants.

I assembled the short-read data using ABySS, and the summary stats from Quast and ABySS are decent, I think? N50: 16kb L50: 10k N75: 7kb L75: 25k total length: 627Mb (close to the expected genome size)

Now that I have this assembly, I would like to do masking and annotation. For masking I want to use RepeatModeler and RepeatMasker, but when I tried to run RepeatModeler, the job finished in minutes, which seems wrong? I did get a warning that N50 was 7kb and it didn't like that. Not sure why the N50 is different from Quast and ABySS, though. RepeatModeler also finished after only doing round1. Should I go about constructing a repeat library/masking in a different way? I know masking with an assembly this fragmented won't be the best, but I'm not sure if there's something else I should do?

For annotation I'm going to use Braker2 and there is a high-quality assembly of a plant with an LCA ~80Mya, so I think that should help a lot.

Any suggestions on how I should move forward are very welcome and appreciated, thanks!

masking genome genomics assembly short-reads annotation • 2.3k views

ADD COMMENT • link 3.1 years ago by grevanksi • 0

score 3 · Answer 1 · 2022-04-05

3

Entering edit mode

3.1 years ago

Philipp Bayer 8.8k

Your assembly stats look OK to me.

I've had problems with RepeatModeler when the genome was highly fragmented, which may be your case. For repeat-modeling, I've made good experiences with EDTA https://github.com/oushujun/EDTA

I run my assembly through EDTA following the divide-and-conquer approach so it's slightly faster, https://github.com/oushujun/EDTA#divide-and-conquer , then I assign known TE classes to the TEs using TEsorter, then use RepeatMasker to get a softmasked assembly for BRAKER (-xsmall in RepeatMasker).

I run BRAKER with --softmask to tell it that I softmasked repeats. It's described here https://github.com/Gaius-Augustus/BRAKER/issues/348

With one or two plant genomes I had too many gene models after BRAKER finished, I then used the AUGUSTUS and GeneMark models trained by BRAKER in another MAKER run with other external evidence I had. In your case you might be able to get away using just the relative's proteins as external evidence in BRAKER.

Good luck, you can do this!!

ADD COMMENT • link 3.1 years ago by Philipp Bayer 8.8k

0

Entering edit mode

This is so so helpful, thank you very much!

ADD REPLY • link 3.1 years ago by grevanksi • 0

0

Entering edit mode

No worries! I just realised you might not need to run TEsorter, all it does is change the labels/names for the TEs EDTA found. It's useful if you want to analyse what TEs you have, but not useful if you just want to repeatmask your genome.

ADD REPLY • link 3.1 years ago by Philipp Bayer 8.8k

0

Entering edit mode

Hey there, do you know how long EDTA normally takes to run? I started my run on Wednesday, but I've been stuck here since:

Wed Apr  6 13:56:31 CDT 2022    Identify LTR retrotransposon candidates from scratch.

Thanks! It looks like I'm using about 40 CPUs currently.

ADD REPLY • link 3.1 years ago by grevanksi • 0

0

Entering edit mode

You may be running into this issue: https://github.com/oushujun/EDTA/issues/225

I have, in the past, run a hacky trick where I merged all contigs into sets of <10 fake pseudomolecules with Ns between the contigs, then ran EDTA with that, then ran repeatmasker with the hacky TE database but the database of 'correct' contigs. The de novo TE finding step doesn't care for positions. Some of the EDTA tools make a file for each contig which can result in millions of files, slowing everything down. With my hacky trick there are not many files.

You could also try removing smaller contigs (< 1 Kbp length?).

ADD REPLY • link 3.1 years ago by Philipp Bayer 8.8k

0

Entering edit mode

Thanks for the suggestion! I can see that the .mod.harvest directory is still having files written to it, so I think the issue is probably coming from how fragmented the assembly is. I'll remove the smaller contigs first and rerun. Then I'll try concatenating things if that doesn't help. Thank you!

ADD REPLY • link 3.1 years ago by grevanksi • 0

score 1 · Answer 2 · 2022-04-12

1

Entering edit mode

3.1 years ago

BioinformaticBird ▴ 110

Hi, you may consider using MOSGA for the annotation, which includes BRAKER2 but provides functional annotation additionally and a submission-ready output. Furthermore, you can use the genome browser to take a look at your genome annotation.

MOSGA Source-Code

ADD COMMENT • link 3.1 years ago by BioinformaticBird ▴ 110