Hi,
I am currently working on the bacterial genome sequencing. This is the original post (C: Kmer selection for bacterial WGS denovo assembly using SPAdes or SOAP-denovo). I did two different analyses one with the original trimmed reads and the other with downsampled reads (using BBNorm).
Do I need to use
Downsampled dataset: ~35 million reads
Genome assembly using SPAdes assembler
SPAdes Command:
python3 $home/bin/spades.py -o spades_out -1 Sample.R1.fastq.gz -2 Sample.R2.fastq.gz --careful
Genome Evaluation using QUAST:
python3 $home/quast.py scaffolds.fasta --glimmer --use-all-alignments --rna-finding --output-dir quast_output
Original trimmed dataset: ~ 430 million reads
Genome assembly using SPAdes assembler
SPAdes Command:
python3 $home/bin/spades.py -o spades_out -1 Sample.R1.fastq.gz -2 Sample.R2.fastq.gz --careful
Genome Evaluation using QUAST:
python3 $home/quast.py scaffolds.fasta --glimmer --use-all-alignments --rna-finding --output-dir quast_output
Which is the final SPAdes output file (contig.fasta or scaffold.fasta) should be used for downstream analyses?
Should I consider K55 final_contigs/final_scaffolds or the contigs/scaffolds fasta file in the main output directory image below?
Considering the scaffold.fasta for both downsampled and original trimmed reads, I evaluated the assembly using QUAST. How to interpret the results from these tables?
Downsampled results refer to normalized data since OP used
bbnorm.sh
as reflected by most numbers. I don't know why N's are higher in normalized data.bioinforesearchquestions : Were these reads completely cleaned of artifacts before being normalized?
I wonder if a further reduction in data would help. If you have the time to do it you may want to try.
Probably because at the scaffolding step, some contigs could be merged due to the paired-end reads, SPAdes then fills the gaps with a small amount of
N
s. The full-data assembly, however, is so fragmented that SPAdes wasn't able to link contigs into scaffolds, my guess is due to reads mapping to multiple erroneous contigs.Hi genomax,
As requested please find the quality reports for the raw and trimmed reads used for the assembly step.
Even for BBNorm step
bbnorm.sh in=reads.fq out=normalized.fq target=1000 min=30
, I used the trimmed reads to get the normalized downsampled dataset.Raw_FASTQC
Trimmed FASTQC
Hi @Genomax/h.mon,
Now I have reassembled the scaffolds.fa (from SPAdes output) against a related reference genome using AlignGraph tool. I read this step will improve the assembly.
Generally what are the parameters to be considered from QUAST report for rating the assembly is good enough to proceed further for the downstream analyses?
Hi @Genomax/h.mon,
Now I have reassembled the scaffolds.fa (from SPAdes output) against a related reference genome using AlignGraph tool. Then using QUAST evaluated the remaining_contigs.fasta from AlignGraph.
How to interpret this part of the result?
bioinforesearchquestions : I have not used the tools you are referring to above so can't directly assist.
In biostar slack chat with @h.mon we agreed that you are likely not going to get a single closed genome with the data you have. If that is your ultimate goal then you may want to look at alternate sequencing technologies to supplement your Illumina data.
With regards to the metrics between downsampled and entire dataset, which contigs can be used for downstream analyses?
Bcos N50 for downsampled is 212,867 where as the entire dataset is 3,574.
Higher the N50 is better or the lower?
https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50
N50 can be described as a weighted median statistic such that 50% of the entire assembly is contained in contigs or scaffolds equal to or larger than this value.
Higher N50 is better result.