Plasmid Genome Assembly
3
1
Entering edit mode
8.9 years ago

Which will be best tool for plasmid assembly for illumina truseq data.

I have used velvet and spades but result are not good.

Can anyone please suggest me assembler or parameter for spades and velvet that will give good assembly of plasmid

plasmid • 5.7k views
ADD COMMENT
1
Entering edit mode
8.9 years ago
piet ★ 1.9k

I am getting around 700 contig and 6 mb genome

6 Mb is the typical size of a whole bacterial genome. Bacterial genomes are comprised of a chromosome (usually only one) and some or several plasmids. If you have prepared the DNA from a single colony then you should get less than 100 contigs. 700 contigs indicates that either your DNA was not homogeneous (eg contaminated with a second strain of bacteria) or that the coverage of the chromosome is very low.

How have you prepared the DNA? Have you done any step to separate plasmidic DNA from chromosomal DNA? Such separations are never 100 % selective! My guess is, that there was still enough chromosomal DNA which was sequenced with low coverage. Therefore the chromosomal DNA is dispersed over hundreds of contigs.

The FASTA file emitted by Spades reports the coverage of every contig.

>NODE_1_length_711720_cov_34.8955_ID_4768
>NODE_24_length_3121_cov_199.103_ID_4814

Please sort your contigs by coverage. Then inspect the contigs with the highest coverage. They will presumably comprise plasmidic sequences (or the highly redundant rRNA genes).

ADD COMMENT
0
Entering edit mode
8.9 years ago
Adrian Pelin ★ 2.6k

The assemblers you tried should be able to do the job, provided that you have tried a reasonable amount of assemblies with varying parameters. Whether you are assembling a plasmid or not makes little differences, please provide more info with regards to your dataset. For instance, sequencing depth of plasmid, length of reads, paired or not paired, is there anything else that is being sequenced? Also would be good to show what command lines you have already tried with velvet and spades, and tell us why the results are not good.

I suspect the problem is the data, and not the assembler. I am dealing with plasmid assembly myself and I notice problems with variable coverage, probably something to do with the biology of plasmid replication, since this variability is consistent among 2 different sequencing methods.

ADD COMMENT
0
Entering edit mode

Many thanks for replay

sequencing depth of plasmid, => 550X

length of reads, 150

paired or not paired, => paired

is there anything else that is being sequenced => no

t command lines

1.velvet=>

VelvetOptimiser.pl -t 50  --p Sample12 --d sample12 --a -o "-min_contig_lgth 200 -scaffolding yes"  -f '-fastq -shortPaired  R1_001_150.fastq R2_001_150.fastq'

AND

velveth 1_Output_velveth 69,73,2 -fastq -shortPaired -separate R1_001.fastq_filtered R2_001.fastq_filtered
velvetg inputkmer -cov_cutoff auto -read_trkg yes -min_contig_lgth 200 -amos_file yes -ins_length auto  -exp_cov auto -ins_length_sd 50 -scaffolding yes

2 spades

SPAdes-3.5.0-Linux/bin/spades.py -o SO_5216_BND11_S11_L001_1  -k 21,33,55,77 --careful --only-assembler -1 R1_001_150.fastq -2 R2_001_150.fastq -t 20

AND

spades.py -o S11_L001 -1 R1_001_150.fastq_filtered -2 R2_001_150.fastq_filtered -t 30 -k 41,43,45,47,49,51,53,55,57,59,61,63,65,67,69,71,73,75,77,79,81 --cov-cutoff auto

The output I am getting is around 6 mb genome and scaffold ~700. that is too far from expected results

ADD REPLY
0
Entering edit mode

My recommendations:

  • have a look at a kmer distribution to see if you have a sequencing bias, contaminations, actually 550X ... Expect a second smaller peak at ~1100X, which represents inverted repeats of the plasmid
  • subsample to 100X (take only ~20% of your read data)
  • get latest spades (3.6.x), run with default settings including error correction.
  • map reads to contigs and remove stuff with low coverage -> contaminations
  • plasmids usually comprise one or more inverted repeats that are part of the replication mechanism. Those often cannot be resolved properly by the assembler, but you can identify these regions/contigs by coverage as well - should be double. You will probably have to copy and paste these contigs together by hand
ADD REPLY
0
Entering edit mode

So there is something else being sequenced, the nuclear genome. If that genome is available, or if you can assemble it, then I would try to filter out the reads that map to the nuclear genome (provided that you do not have high identity regions common to both the nuclear genome and the plasmid). Very odd that you get only 2 contigs, that you are able to assemble the nuclear genome so well, if not fully and cannot assemble the plasmid.

ADD REPLY
0
Entering edit mode

All of the suggestions given so far are good. I would add that you can try our tool Recycler. It takes into consideration some of the same features as suggested here - coverage, circularity of sequences, and paired end mapping. I posted more details here (and in the links therein): Recycler for plasmid assembly

ADD REPLY
0
Entering edit mode
8.9 years ago

I would generally recommend Spades as the best assembler for things like plasmids. But considering that you have tried it, what, specifically, is the problem? Do you get too may contigs, or does it not assemble a all?

ADD COMMENT
0
Entering edit mode

Yes it is assembling , I am getting around 700 contig and 6 mb genome, that is too far from our expectations

ADD REPLY
0
Entering edit mode

I would say this is a common problem with the nowadays "short" sequencing technology. A colleague of mine, tried to sequence a short genome, and he needed 7 years to fully complete it, and he eventually did it by using PacBio sequencing.

I think you need to use more than short reads. As in your case, you eventually discover that assembly noes not improve even though you increase the coverage of what you are sequencing.

The use of mate-paired reads will help you by doing a better scaffolding of your contigs. If your plasmid is a commercial one, a comparison with trusted and similar plasmids using programs like Mauve will help a lot in the task of ordering the contigs. You can also combine several kind os sequences, like the regular Illumina, mate-pairing, long Illumina reads and/or PacBio sequences. Otherwise, I think you will be hitting a hard task

ADD REPLY

Login before adding your answer.

Traffic: 2577 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6