Question

Scope for improvements in a whole Genome Assembly

0

Entering edit mode

6.9 years ago

mks002 ▴ 220

Hello All, I tried assembly of one of plant genome (size 2.8 Gb) using Masurca. My concern is masurca predicted a ~2.9 Gb assembly and when assembly was over it resulted ~3.5 Gb.

And the assembly N50 values is ~100kb . can anyone suggest if we can still improve the assembly and get the desired assembly size.

Any suggestion is appreciable.

sequence Assembly • 1.7k views

ADD COMMENT • link updated 6.9 years ago by lakhujanivijay 5.9k • written 6.9 years ago by mks002 ▴ 220

1

Entering edit mode

Hi msk002

Can you please add more details like what data did you have (i.e coverage, data size, platforms; hybrid?)

Of course, there are scope for improvements. What is the minimum scaffold size? How many gaps are there? What plant is that? Are there too many repeats in the genome?

Update: I have updated the title of your post.

ADD REPLY • link 6.9 years ago by lakhujanivijay 5.9k

0

Entering edit mode

Thanks Vijay Lakhujani we are working with 130x data, with Illumina WGS data(110X), 3 Mate pair data (5X), Pacbio data (13X). So around 1000 million Wgs reads and 40 gb of pacbio reads.

On assembly stats

No. of contigs :    1,26,020
Maximum Contig Length : 15,04,108
Minimum Contig Length :       85
Assembly Length :   3,59,11,15,180
Total Number of Non-ATGC Characters :   1,78,68,589
Percentage of Non-ATGC Characters :    0.498
Contigs >= 100 bp : 1,26,018
Contigs >= 200 bp : 1,26,017
Contigs >= 500 bp :   89,996
Contigs >= 1 Kbp :    68,235
Contigs >= 10 Kbp :   49,912
Contigs >= 1 Mbp :         6
N50 value : 1,09,409

Repeat masking on this assembly resulted ~13% of repeats. I think repeats will on higher side. My concern is genome size which should be either around ~2.9 Gb or 3.1 Gb and N50 of ~500 kb

I am looking for solution as i am short on time.

ADD REPLY • link updated 6.9 years ago by GenoMax 151k • written 6.9 years ago by mks002 ▴ 220

0

Entering edit mode

Those comma's are on strange positions in that table, is that a formatting issue?

How (or why) do you conclude you should get an N50 of 500kb?

Do you have any idea about the level of heterozygosity of that genome? or the ploidy state of the species?

ADD REPLY • link 6.9 years ago by lieven.sterck 15k

0

Entering edit mode

Hello lieven.sterck . Thank you for your reply. Comma's are just given for counting 10's and 100's digits. Yes I was hoping for 500 kb N50 as I have performed another non-hybrid assembly having around 12 % of N's.
It's a diploid species .

ADD REPLY • link 6.9 years ago by mks002 ▴ 220

0

Entering edit mode

Hello Vijay Lakhujani, can u detail what improvement approaches are there.

ADD REPLY • link 6.9 years ago by mks002 ▴ 220

0

Entering edit mode

You could try to get an estimate of the genome size by analyzing Kmer freq plots (eg. using the genomescope website). How certain are you of that given size?

ADD REPLY • link 6.9 years ago by lieven.sterck 15k

0

Entering edit mode

For genome size, in lab estimated genome came around 2.9 Gb. And masurca assembler too predicted around 2.9 Gb at intial assembly step. And my non-hybrid approch resulted in 2.9 Gb of genome but with 12% gaps.

ADD REPLY • link 6.9 years ago by mks002 ▴ 220