Question

The soapdenovo2 genome assembly is highly fragmented - learning genome assembly with public data

0

Entering edit mode

5 weeks ago

JieQY • 0

I used soapdenovo2 to assemble plant genomes, and the input was a fastq file. But the output failed because N50 was very short, indicating fragmented assembly.

N10     1722    3350
N20     841     10658
N30     403     25957
N40     208     57527
N50     127     117394
N60     127     185877
N70     127     254360
N80     127     322843
N90     121     392445

But even though I changed the k-mer to 41, I got similar fragmentation results.

Below is the configuration file when running soapdenovo2:

max_rd_len=150

[LIB]
avg_ins=350
reverse_seq=0
asm_flags=3
rd_len_cutoff=150
rank=1
pair_num_cutoff=3
map_len=32
q1=../../../../ref/raw/fastq/SRR9257061_1.fastq.gz
q2=../../../../ref/raw/fastq/SRR9257061_2.fastq.gz

Below is the script to run soapdenovo2:

#conda activate soapdenovo
SOAPdenovo-63mer pregraph -s config.txt  -o SRR9257061  -K 63 -R  -p 6  2>./pregraph.log

SOAPdenovo-63mer contig -g SRR9257061 -p 20 -R 2>./contig.log

SOAPdenovo-63mer map -s config.txt -g SRR9257061  -p 6 2>./map.log

SOAPdenovo-63mer scaff -g SRR9257061 -p 6 -F 2>./scaff.log

According to k-mer analysis, the estimated genome size is 21,741,143 bp, uniq: 22.7%, heterozygosity: 6.47%

I have very little experience with assembling genomes, so I would be grateful for any advice you could give me!

Gene soapdenovo2 assembly • 552 views

ADD COMMENT • link updated 5 weeks ago by lieven.sterck 15k • written 5 weeks ago by JieQY • 0

0

Entering edit mode

welcome to the wonderful world of plant genome assembly :)

could you provide a bit more detail to your question? eg. what is the expected genome size? what is (on how much) input data you are using? ... Typically you will need at least 50-60x coverage of data to get some decent assembly (and decent likely does not even go in the direction of what you hope to get. See also the comments of colindaven below)

ADD REPLY • link 5 weeks ago by lieven.sterck 15k

0

Entering edit mode

Looks like OP is using publicly available data so likely has no control over the quantity or quality of the data. Perhaps this is a learning exercise.

ADD REPLY • link 5 weeks ago by GenoMax 150k

0

Entering edit mode

Thank you very much for your comment! Yes, I am in the stage of learning genome assembly, so I use public data. According to k-mer analysis, the estimated genome size is 21,741,143 bp. Data from Arabidopsis thaliana:

ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/000/SRR9257060/SRR9257060_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR925/000/SRR9257060/SRR9257060_2.fastq.gz

Total Bases: 4.9 Gbp Total Sequences: 33,115,474 Sequence Length: 150 bp

I know that Next Generation Sequencing data presents a big challenge when assembling plant genomes with high heterozygosity and high repetitive sequences due to its short read length. I want to analyze transposons in plants. I want to know whether it will be better to use pure third-generation sequencing for assembly or a hybrid assembly of Next Generation Sequencing and third-generation sequencing for plant genomes?

ADD REPLY • link 5 weeks ago by JieQY • 0

1

Entering edit mode

Use current long read data from ONT or Pacbio and, like you are doing, a small-ish genome like Arabidopsis. I would use illumina data more for polishing than actual contig assembly. Check out the many, many recent genome papers for a sense of best practices to follow.

ADD REPLY • link 5 weeks ago by colindaven 7.4k

1

Entering edit mode

OK, data is on the low end for ATH but should still get you better result than you get ...

also the Kmer analysis result is a bit funky, which tool did you use for it? or it might be that data file your selected is quite crappy data ... (which would also explain the assembly result) Perhaps try a different dataset?

(ah, and arabidopsis is not that repetitive nor is it very heterozygotic, it's actually a selfing plant so as near homozygotic as it might get ;-) , but I understand why you bring that up :) )

pff, right if your plan is to investigate TEs you will need a good assembly (as those are exactly the difficult to assemble regions) and you will benefit much more from the advice from colindaven elsewhere in this post (in a nutshell: drop the short reads and go for primarily long reads)

ADD REPLY • link 5 weeks ago by lieven.sterck 15k

score 1 · Answer 1 · 2025-03-12

Basically, you need Pacbio HiFi or Oxford nanopore long reads to assemble genomes to a good level of contiguity (high N50 in the MB level). With short reads you are lost, you can expect contig N50s <100 kbp. Plants are highly repetitive so only some parts of the gene space will be assembled using short reads only.