Hi everyone,
I have paired end reads whole genome sequencing data of Brassica. I performed mapping short reads to the reference genome and extracted the unmapped reads. Then, I assembly these unmapped reads into de novo contigs using (MaSuRCA and SPAdes). Here are assembly stats generated by QUAST:
- assembly stats for assembled sequence generated by MaSuRCA:
Assembly primary.genome.scf
# contigs (>= 0 bp) 6180
# contigs (>= 1000 bp) 1284
# contigs (>= 5000 bp) 8
# contigs (>= 10000 bp) 0
# contigs (>= 25000 bp) 0
# contigs (>= 50000 bp) 0
Total length (>= 0 bp) 4715175
Total length (>= 1000 bp) 2119701
Total length (>= 5000 bp) 47692
Total length (>= 10000 bp) 0
Total length (>= 25000 bp) 0
Total length (>= 50000 bp) 0
# contigs 3546
Largest contig 6948
Total length 3703670
GC (%) 39.04
N50 1110
N90 600
auN 1449.9
L50 1030
L90 2868
# N's per 100 kbp 0.00
assembly stats for assembled sequence generated by SPAdes:
Assembly contigs
contigs (>= 0 bp) 52187
# contigs (>= 1000 bp) 2881
# contigs (>= 5000 bp) 47
# contigs (>= 10000 bp) 1
# contigs (>= 25000 bp) 0
# contigs (>= 50000 bp) 0
Total length (>= 0 bp) 20642697
Total length (>= 1000 bp) 5141662
Total length (>= 5000 bp) 287508
Total length (>= 10000 bp) 12949
Total length (>= 25000 bp) 0
Total length (>= 50000 bp) 0
# contigs 8583
Largest contig 12949
Total length 9033035
GC (%) 37.17
N50 1133
N90 578
auN 1612.0
L50 2293
L90 6900
# N's per 100 kbp 0.00
Then, I screen the contamination and remove it from the assembled sequence (I used Foreign Contamination Screening FCS-GX NCBI) and I got 2 output files: clean.fasta (the assembled sequence after romove contamination) and contamination.fatsa (List of comtanination sequence). Here are sequence stats of raw assembled sequence, cleanly assembled sequence, and contamination sequence: Based on the above stast, I saw that SPAdes generated has contamination contig than MaSuRCA I aslo perform RepeatModeler for these clean.fasta to detect the repetitive sequence inside clean.fasta. The number of repetitive sequence in clean.fasta (MaSuRCA) is 28 sequences and SPAdes has 122 sequences. The RepeatModeler masked stats for MaSuRCA clean.fasta sequence:
Sample Stats: Sample Size 3964084 bp
Num Contigs Represented = 4900
Non ambiguous bp:
Initial: 3964084 bp
After Masking: 3870473 bp
Masked: 2.36 %
-- Input Database Coverage: 3964084 bp out of 3964159 bp ( 100.00 % )
The RepeatModeler masked stats for SPAdes clean.fasta sequence:
Sample Stats:
Sample Size 10000678 bp
Num Contigs Represented = 21926
Non ambiguous bp:
Initial: 10000678 bp
After Masking: 9681079 bp
Masked: 3.20 %
Based on these information, are there any suggestion for me to choose which assembler is better in this case ? (I mean which assembled sequence is better ). Thank you.