Entering edit mode
6.0 years ago
ahmad mousavi
▴
800
Hi
I have done Bacterial genome sequencing using Illumina Hiseq PE *150b , my library contains 600k reads, But after assembling with spades ( Kmer = -k 21,33,55,77,99,111,127) the result is too bad. I have got ~3400 contigs. I have no ref. genome for my bacteria now, we only know its family. My GC content = ~70%
What is your suggestion for decreasing number of contigs? Is there any other options better than Spades for bacteria genome assembly?
Thanks
Not a bioinformatics solution, but your assembly could greatly improve by adding some long read sequencing data from Oxford Nanopore or PacBio, of which the former (MinION) can be reasonably cheap to obtain.
A GC content that high probably also means its repetitive. It’s likely to be a sequencing nightmare. Your only options are to sequence deeper, and use other technologies as Wouter said.
we all agree with Wouter :), but would a high GC not indicate less repetitive? TE (transpsoson?) are usually rather high in AT, so that would lower the overal GC, no?
I was more thinking of consecutive repeats (e.g. GCGCGCGCGCG), rather than IS etc, which would fail to be picked up properly by the sequencer.
Nevertheless, there are other issues with high GC - the increased strand separation energy might be an issue for library preps and the actual sequencing reaction.
ah, ok, yep agreed in that case.
and totally on the problems (regardless of the 'cause') when extracting/lib-prep/sequencing in high GC situations
Would it be possible to provide some more info on your project? eg. estimate genome size (what is the expected coverage)? is it some 'weird/exotic' bacterium?
Sorry, I have no idea, we estimate genome size is ~7Mb, just estimation. We tried to have 100x coverage.
so that will give you roughly 25x , on the low side but doable I think
It seems you have used several k-mer sizes. Is the contig number same across all the K-mer sizes? ahmad mousavi
Spades let you to define several kmers and it automatically select one based on data structure. So I have constant no. of contigs.
did you have a look at fastg files and the number of contigs for each kmer? You can also check how good your assembly with Bandage https://github.com/rrwick/Bandage. ahmad mousavi
No, I don't understant of relationship of fastq file.
With smaller kmer I got more contigs.
not fastq, it is fastg (updated the post). Spades outputs contigs for each kmer. With higher Kmer, contig number goes down. But the relevancy of such assembly is in question. For that reason, you may need to use software like bandage or quast/ICARUS to identify the relevant assembly
SPAdes automatically chooses optimal Kmers. The
contigs.fasta
that you get output which is not inside one of theK***
folders should be the ‘optimal’ assembly (if I remember correctly).Optimal doesn’t necessarily mean fewest contigs though.