Hi, I have sequenced a bacterial genome for which i have a reference genome (98% similarity).
I have used bwa to map reads to the reference genome: bwa mem reference.fa reads.R1.fq.gz reads.R2.fq.gz
I´m failing to recover the plasmid although i know it´s there. I have run the assembly using megahit and align the contigs to the plasmid and i recover 88% of the plasmid.
What i don´t understand is why the reads do not map to the plasmid ???? - samtools flagstat PLASMID.sorted.bam -
1435694 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
0 + 0 mapped (0.00% :N/A)
1435694 + 0 paired in sequencing
717847 + 0 read1
717847 + 0 read2
0 + 0 properly paired (0.00% : N/A)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00% : N/A)
0 + 0 with mate mapped to a different chr 0 + 0 with mate mapped to a different chr (mapQ>=5)
If i check the reads after the genome assembly i get pretty good mapping
1122036 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
574 + 0 supplementary
0 + 0 duplicates
1116358 + 0 mapped (99.49% : N/A)
1121462 + 0 paired in sequencing
560767 + 0 read1
560695 + 0 read2
1108556 + 0 properly paired (98.85% : N/A)
1110902 + 0 with itself and mate mapped
4882 + 0 singletons (0.44% : N/A)
1598 + 0 with mate mapped to a different chr 1210 + 0 with mate mapped to a different chr (mapQ>=5)
Any idea why i´m missing the plasmid when aligning clean reads directly to the plasmid ???
Can you try using
bbsplit.sh
from BBMap suite using plasmid and genome sequence at the same time to bin the reads? You have not said what length your reads are (are they trimmed/cleaned of adapters). Pay attention to the settings about the reads that multi-map (across and within the genomes provided)Thanks for your response genomax. It´s an illumina 2*250bp on a single bacterial genome. It turns out that insert size average is 300, not that good, but i have quality trimmed all sequences and remove adapters and phiX genome.
Here is the output from bbspplit
The idea behind the sequencing of that specific strain is that is´s phenotypically different from the reference, so the idea is to look at the genome and find if there is genomic event that might explain this phenotipically difference.