Dear All, I just finished my very first virus assembly by SPAdes. The following is the result from QUAST and command line.
My question is:
1, what is the normal N50? My N50 is 742, is this very low? 2. How should I choose the best k-mer length?
Thanks a lot.
Result
contigs 478
contigs (>= 0 bp) 11005
contigs (>= 1000 bp) 78
contigs (>= 5000 bp) 1
contigs (>= 10000 bp) 1
contigs (>= 25000 bp) 0
contigs (>= 50000 bp) 0
Largest contig 14128
Total length 390242
Total length (>= 0 bp) 3584698
Total length (>= 1000 bp) 138381
Total length (>= 5000 bp) 14128
Total length (>= 10000 bp) 14128
Total length (>= 25000 bp) 0
Total length (>= 50000 bp) 0
N50 742
N75 580
L50 145
L75 296
GC (%) 50.03
Mismatches
N's 0
N's per 100 kbp 0
Command line
$bbduk in=$r1 in2=$r2 out=trimmed.fq ktrim=r k=23 mink=11 hdist=1 ref=$bbduk_ref tbo tpe
$bbnorm in=trimmed.fq out=normalized.fq target=100 min=5
$spades -k 21,41,71,101,127 -o spades_out --12 trimmed.fq --careful
$quast spades_out/contigs.fasta -o quast_out_contigs -t 16 -l 4038-Roc
Hello archie.w.lee,
Please use the formatting bar (especially the
code
option) to present your post better. I've done it for you this time.Thank you!
thank you very much!
k-mer values would depend on average read length in the data, have you tried to run Spades without specifying the kmer and let it choose the best kmer values depending on the read length? What is expected genome size?
Thanks. I am trying the following command and will update the results when it finishes. The genome is about 9k.
archie.w.lee : Actually you should try using
tadpole.sh
from BBMap. It is supposed to work very well with viral genome assemblies.Thanks, I will try and update the results.
Dear all, I blast my assembly contig and all are Homo sapiens mitochondrion. Is that a host genome contamination? Thanks
Seems likely doesn't it?
Please either use the
ADD COMMENT
button to ask this under the relevant answer / comment, or open a new question altogether. TheAdd your answer space
should be reserved to answers to the top-level (original) question.Is the host human? Then yes, it is host DNA (or RNA?) you are seeing. How did you perform DNA extraction? Did you enrich for virus particles somehow? How many contigs did you obtain from the assembly, and are really all of them from mitochondrial DNA?
wow, how did you know the assembly contigs all from human mitochondrial? The total contigs are 478. The largest one is 14128. but the genome of the virus is about 9k.
I blasted about 2k raw reads and random pick 80-100 assembled contigs, they all belong to Human.
Blast all contigs, my experience is just a few or even one is the viral genome, the rest is host contamination.
If you have a reference viral genome, you can use bbduk.sh to filter viral reads, or even bbsplit.sh with human and viral genomes to separate reads pertaining to each genome. You could then assemble using just the viral reads.
Thank you so much. I will do that. I just check the ddbuk manual, should I use the following command line? Can you please give me some pointer. The virus genome is about 9k.
Don't use
literal=
, use the virus genome as reference, withref=virus.fa
.Thank you so much for your replying and information.
Spades runs several different kmer lengths that it determines automatically from your sequence data. Let it do the hard work.
As for N50 length, that'll depend on your sequencing platform.