dear colleagues, i need your advice. i have a set of illumina reads (2x250) for the plasmid with the size of 110 kb.
first, the reads were trimmed:
LEADING:10 TRAILING:30 SLIDINGWINDOW:4:15 MINLEN:50, adapters were cut
then i tried to assemble them with or without kmer size specification:
spades.py \
--pe1-1 aaa/trimmed/lane1_forward_paired.fastq \
--pe1-2 aaa/trimmed/lane1_reverse_paired.fastq \
--pe1-s aaa/trimmed/lane1_forward_unpaired.fastq \
--pe1-s aaa/trimmed/lane1_reverse_unpaired.fastq \
--careful --plasmid -o aaa/wk
and had no success. i got a lot of very short contigs.
then i did the normalization (for all four trimmomatic files, target was 20 or 50):
/home/bobii/bbmap/bbnorm.sh \
in=aaa/trimmed/lane1_forward_paired.fastq \
out=aaa/trimmed/20_norm_lane1_forward_paired.fastq \
target=20 min=5
and repair, because i got spades error Message 'Pair of read files aaa/trimmed/20_norm_lane1_forward_paired.fastq and aaa/20_norm_lane1_reverse_paired.fastq contain unequal amount of reads'.
so eventually i used for spades assembly trimmed normalized repaired data, but again had no success, spades dont make contigs and writes Skipping processing of contigs (empty file)
any ideas? what i do wrong? thank you.
You may need to discard reads which are known to not be plasmid relevant (e.g. by aligning/mapping to the chromosome and discarding). This may help.
Also consider using
plasmidSPAdes
instead of regular spades. I doubt the trimming is really the issue (errors relating to borked files notwithstanding).i thought so, but i dont have chromosome sequence. the plasmid was isolated and sequenced separately. or there's any other way to discard plasmid irrelevant reads?
when i use spades, i flag --plasmid, it is the same
So there is no reference sequence data for this organism at all? (Plasmid or chromosome?)
The coverage is about 100, not so large to cause such assembly problems, i think.
According to your advice i took from genbank the sequence of Pseudomonas KT2440 chromosome. this strain is the carrier of our plasmid. i mapped the reads to the chromosome and saved the files with unmapped reads. then i did trimming of these reads and:
only paired reads, spades without normalization. 1000 contigs
only paired reads, spades after normalization (20). it sorts through different k's and fails when k = 77. in the file of k = 55 there are 8 contigs with the total length of ~2000 bp.
unicycler after normalization (20), only paired reads. 1 contig of 1000 bp.
unicycler without normalization, only paired reads. 59 contigs, 1000-6500 bp, total length 89 kb.
You could also try out
tadpole.sh
which is part of BBMap suite and good for assembling small genomes. A guide is here.Also consider the possibility that some parts of the plasmid may not have been sequenced or have repeats etc that can't be resolved using just short reads.
If we assume the raw data is good and representative of the real sequence, do you know anything else about this plasmid at all?
Is it repetitive? Does it carry e.g. phage components?
If it is a sequence of low complexity and high repetition etc, it may be that you simply will not get a single contig from this and you may need to do some long read sequencing instead (but I don't think we need to conclude that just yet, though plasmids are notoriously difficult).
I also suggest that you try the assembly with just the properly paired reads first (do not use the singletons). This is a small plasmid and you may already have over sampled data.
thanks, but i'm still confused. i did so (for the reads more strictly trimmed than in start topic text):
unicycler with the reads after trimmomatic (not normalized, not repaired). thousands of short contigs
unicycler with the reads normalized and repaired (without singletons). the resulting file contains 5 contigs with the total length of 11 kb.
spades with the reads normalized and repaired (without singletons), as you suggest. at the moment it is the best result - 47 contigs, but still not what i want.
Do you have some idea of the depth of coverage? If you have several hundred or even thousand fold coverage, this can cause de bruijn graph assemblers like SPAdes to choke. You may need to randomly downsample your data.