I used soapdenovo2 to assemble plant genomes, and the input was a fastq file. But the output failed because N50 was very short, indicating fragmented assembly.
N10 1722 3350
N20 841 10658
N30 403 25957
N40 208 57527
N50 127 117394
N60 127 185877
N70 127 254360
N80 127 322843
N90 121 392445
But even though I changed the k-mer to 41, I got similar fragmentation results.
Below is the configuration file when running soapdenovo2:
max_rd_len=150
[LIB]
avg_ins=350
reverse_seq=0
asm_flags=3
rd_len_cutoff=150
rank=1
pair_num_cutoff=3
map_len=32
q1=../../../../ref/raw/fastq/SRR9257061_1.fastq.gz
q2=../../../../ref/raw/fastq/SRR9257061_2.fastq.gz
Below is the script to run soapdenovo2:
#conda activate soapdenovo
SOAPdenovo-63mer pregraph -s config.txt -o SRR9257061 -K 63 -R -p 6 2>./pregraph.log
SOAPdenovo-63mer contig -g SRR9257061 -p 20 -R 2>./contig.log
SOAPdenovo-63mer map -s config.txt -g SRR9257061 -p 6 2>./map.log
SOAPdenovo-63mer scaff -g SRR9257061 -p 6 -F 2>./scaff.log
According to k-mer analysis, the estimated genome size is 21,741,143 bp, uniq: 22.7%, heterozygosity: 6.47%
I have very little experience with assembling genomes, so I would be grateful for any advice you could give me!
welcome to the wonderful world of plant genome assembly :)
could you provide a bit more detail to your question? eg. what is the expected genome size? what is (on how much) input data you are using? ... Typically you will need at least 50-60x coverage of data to get some decent assembly (and decent likely does not even go in the direction of what you hope to get. See also the comments of colindaven below)
Looks like OP is using publicly available data so likely has no control over the quantity or quality of the data. Perhaps this is a learning exercise.
Thank you very much for your comment! Yes, I am in the stage of learning genome assembly, so I use public data. According to k-mer analysis, the estimated genome size is 21,741,143 bp. Data from Arabidopsis thaliana:
Total Bases: 4.9 Gbp Total Sequences: 33,115,474 Sequence Length: 150 bp
I know that Next Generation Sequencing data presents a big challenge when assembling plant genomes with high heterozygosity and high repetitive sequences due to its short read length. I want to analyze transposons in plants. I want to know whether it will be better to use pure third-generation sequencing for assembly or a hybrid assembly of Next Generation Sequencing and third-generation sequencing for plant genomes?
Use current long read data from ONT or Pacbio and, like you are doing, a small-ish genome like Arabidopsis. I would use illumina data more for polishing than actual contig assembly. Check out the many, many recent genome papers for a sense of best practices to follow.
OK, data is on the low end for ATH but should still get you better result than you get ...
also the Kmer analysis result is a bit funky, which tool did you use for it? or it might be that data file your selected is quite crappy data ... (which would also explain the assembly result) Perhaps try a different dataset?
(ah, and arabidopsis is not that repetitive nor is it very heterozygotic, it's actually a selfing plant so as near homozygotic as it might get ;-) , but I understand why you bring that up :) )
pff, right if your plan is to investigate TEs you will need a good assembly (as those are exactly the difficult to assemble regions) and you will benefit much more from the advice from colindaven elsewhere in this post (in a nutshell: drop the short reads and go for primarily long reads)