Hello everyone,
I am trying to assemble the unaligned reads into de novo contigs (these unaligned reads was extracted from the mapping between paired end reads to reference genome). I tried with 2 de novo assemblers: MaSuRCA and SPAdes.
1. The assembly stats generated by QUAST for MaSuRCA:
# contigs (>= 0 bp) 6180
# contigs (>= 1000 bp) 1284
# contigs (>= 5000 bp) 8
# contigs (>= 10000 bp) 0
# contigs (>= 25000 bp) 0
# contigs (>= 50000 bp) 0
Total length (>= 0 bp) 4715175
Total length (>= 1000 bp) 2119701
Total length (>= 5000 bp) 47692
Total length (>= 10000 bp) 0
Total length (>= 25000 bp) 0
Total length (>= 50000 bp) 0
# contigs 3546
Largest contig 6948
Total length 3703670
GC (%) 39.04
N50 1110
N90 600
auN 1449.9
L50 1030
L90 2868
# N's per 100 kbp 0.00
2. The assembly stats generated by QUAST for SPAdes:
contigs (>= 0 bp) 52187
# contigs (>= 1000 bp) 2881
# contigs (>= 5000 bp) 47
# contigs (>= 10000 bp) 1
# contigs (>= 25000 bp) 0
# contigs (>= 50000 bp) 0
Total length (>= 0 bp) 20642697
Total length (>= 1000 bp) 5141662
Total length (>= 5000 bp) 287508
Total length (>= 10000 bp) 12949
Total length (>= 25000 bp) 0
Total length (>= 50000 bp) 0
# contigs 8583
Largest contig 12949
Total length 9033035
GC (%) 37.17
N50 1133
N90 578
auN 1612.0
L50 2293
L90 6900
# N's per 100 kbp 0.00
In this case, I want to choose the better assembly sequences between these assemblers.
To check the missed-assembly, I tried to remap the assembled contigs against the raw paired end reads. my expectation is to check that all positions are covered by reads.
Based on the mapping stats, there was very few of the reads do not map back to the contigss, 98% of reads are properly paired (same results between these assemblers)
Then, I screen the contamination sequences in my assembled contigs and remove it (by Foreign Contamination Screening FCS-GX NCBI ). The summarised sequences stats in this figure:
Then I tried to check the repetitive sequences in my clean.fasta with RepeatModeler (clean.fasta is generated after removing the contamination sequences in the primary assembly output).
The summary of the repeat sequences is detected:
I am wondering which assemblers are correct/complete. The assembled sequences generated by MaSuRCA and SPAdes is significantly different. Are there any suggestions for me to determine which one is really the better assembler in this case ? or any further analysis I can do to confirm that?
Thank you.
The reference genome that you aligned to originally - of what quality is it? Is it a reasonably "finished" genome? If so then the unaligned reads you ended up with may not actually represent much useful in terms of additional information.
Have you tried running BUSCO on the assemblies above to see if there is any real information in there that was missing from the original reference (which you should check with BUSCO as well)?
If you are sure that your "unaligned" reads actually are from the correct genome, perhaps what you should try is to assemble the entire dataset and then compare the resulting assembly to the "reference" genome.
Thank you Sir for your suggestion.
The reference genome that I used is Genome assembly at the Chromosome level enter link description here .
I used FCS-GX NCBI to screen for the contamination sequences in my assembled sequence (specific search for TaxID 3705 ("Brassica" species). So I believe that the assembled contigs of unmapped reads is from Brassica.
I have not tried running BUSCO yet. But in case I run BUSCO, which sequence should I run ? ("The primary of assembled sequence" which is the original output from assembly " or I have to concatenate the cleanly assembled contigs (this is cleanly assembled contigs after I remove the contamination sequences) with the original reference genome => original reference genome + cleanly assembled contigs" ?
I am a newbie in this field. I am highly appreciated for any suggestions. Thank you.
The NCBI genome link you posted is from 2014 but the assembly appears to be at the chromosome level. Based on BUSCO analysis included it seems to be 90% complete for single copy genes.
Did the reference you originally used contain
32876 unplaced scaffolds
that are present in the NCBI assembly. If you had not included those pieces then much of what you are assembling from "Unmapped" reads may be already in those scaffolds. If you had included those scaffolds and thus your "unmapped" reads represent real/new sequence how are you going to place those in the full assembly? That is the reason I was suggesting that you try a complete assembly of your entire dataset followed by comparison to the published NCBI assembly.Why did you think that you had "contamination" in your sequences? If you did not then you could be throwing good sequences away. If you did have "real" contamination then your may have a bigger problem on your hand.