de-novo whole genome assembly and annotation
2
0
Entering edit mode
7.9 years ago

I have been trying to assemble a nematode genome denovo from past 6 months. I have got read libraries of different insert lengths 375bp, 500bp and 5kbp. Read length for all libraries are 150bp average. I have used velvet, SOAPdenovo, ALLPATHS-LG, MaSuRCA assemblers so far using kmer length as 75 (after multiple k-mer size estimation using velvet). Except Velvet all other assembler gave me very low N50 and low total genome coverage. Further I have used the assembly I got from velvet to improve and fill the gaps. i have used SSPACE for scaffolding again on this velvet contigs. Then used FinisherSC using 375bp, 500bp and 5kbp insert library reads (three distinct runs) for filling gaps and following are my statistics using gnx tool.

Results using 500bp insert library reads:

Total number of sequences: 17587

Total length of sequences: 120498292 bp

Shortest sequence length : 200 bp

Longest sequence length : 301717 bp

Total number of Ns in sequences: 858844

N50: 30511 (1051 sequences) (60270141 bp combined)

Results using 375bp insert library reads:

Total number of sequences: 17776

Total length of sequences: 121255487 bp

Shortest sequence length : 200 bp

Longest sequence length : 301717 bp

Total number of Ns in sequences: 861716

N50: 30683 (1044 sequences) (60637225 bp combined)

5kbp insert library program is still running. I am in a hurry to proceed further for this genome annotation and further improvement. My genome size estimate to around 150Mbp. As I am pretty new to this kind of work. I am here writing to know if my genome assembly so far is good enough to freeze this assembly and proceed further for genome annotation and other analysis? Or is there any way to improve this assembly further and then proceed for annotation? I also want to know the denovo gene prediction and annotation methods carried out these days.

Assembly • 4.1k views
ADD COMMENT
0
Entering edit mode

I'm not sure what the bigger objective is of assembling this genome, but I would expect if you could add some long reads (Nanopore or PacBio) you could definitely improve the assembly and reduce the number of contigs.

ADD REPLY
0
Entering edit mode

As of now we can not go for long reads or sending the sample to sequence again. We have to proceed with these data and make a draft output. Our bigger goal is to annotate and look for specific genes which are responsible for disease progression in the cattle but when infected the same worm doesn't cause any disease in other mammals

ADD REPLY
0
Entering edit mode

As @WouterDeCoster said using long reads is a good solution, but In your case as you said you can't; did you try to use GapFiller? It may help you to have better result

possibility to manually control the gap closure process

ADD REPLY
0
Entering edit mode

Yes I have used GapClosure tool for that. After Using GapClosure only I went ahead and ran FinisherSC tool. The results are above.

ADD REPLY
0
Entering edit mode

I would suggest to assemble a list of highly conserved genes from closely related organisms, blastx against the genome and see how many of them you're able to find in your draft genome. Nanopre is a good suggestion as well, you should be able to have enough reads will less than $5k

ADD REPLY
1
Entering edit mode
7.9 years ago
Rohit ★ 1.5k

You can keep trying to improve the results but after a point of satisfactory completeness you need to start analyzing for genes. With the data you have N50 of 30kb is pretty impressive. Differences in the assembly tools arise while having multiple insert-sizes, merging overlapping libraries, error correction, long-reads. For our data, we froze the assembly based on N50 and Busco-completeness (Cegma too). Braker and Maker are definitely good-starts for annotation, and if you have transcriptome data, it's always an added advantage.

ADD COMMENT
0
Entering edit mode

Thank you Rohit for a very informative and useful reply. I just have one more query do you know about the singletons and unitigs and how am I going to check and isolate them from this huge number of contigs? Like I want to isolate those contigs which are long and very specific to the species genome. One way I think is to compare my genome with its closely related species like C.elegans and further on the basis of genes finders to isolate and concentrate only on those specific scaffolds. Or is there any other feasible way to find out? I also want to know about miss assembly and tackling repeats in the genome. Thank You

ADD REPLY
0
Entering edit mode

This post might help. Usually contigs are one-step above unitigs (paired-end info is used later) so you do not usually have to go back to them. Since singletons are those with no overlaps due to low coverage or sequencing bias, you do not have to worry about them much unless you find similarity to a genes in a closely-related species.

Your approach should work fine - go for longer contigs and check if they have some abintio predicted genes and attach annotations from known genes.

ADD REPLY
0
Entering edit mode
3.5 years ago
sagnik ▴ 50

Hello,

We have developed a gene annotator called FINDER which can annotate eukaryotic genomes using short-read RNA-Seq reads and protein sequences. It is completely automated and requires no manual intervention. FINDER also runs BRAKER to incorporate predicted genes in the repertoire. You can access the paper from FINDER and the software from here GitHub.

Thank you.

ADD COMMENT

Login before adding your answer.

Traffic: 2499 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6