Hello all,
I have two closely related bacterial strains (mutants of the same wildtype strain for which I have a fully sequenced, closed and annotated genome) that exhibit an interesting phenotype. We have sequenced the genomes of the two strains of interest with MiSeq. I have been able to identify two variants in the sequences. One is a SNP that appears to have arisen in the parent of the two strains during manipulation in the lab and a known lab created deletion, neither of which account for the phenotype.
I'm wondering if a variant may not be hiding in the low or no coverage areas of the sequence. I'm wondering if there is some way to suss that out. I have used bwa mem, bbmap and bowtie alignments with their default settings. I'm thinking that perhaps by changing some of the settings I might get a bit more of the unmapped/poorer quality reads to map to these no coverage regions? I'm hoping someone can help me with some suggestions of what settings to adjust.
Thank you.
My alignment commands:
bwa mem -M -R 'RG data here' reference.fasta seq_R1_001.fastq seq_R2_001.fastq >aligned_reads.sam
bbmap.sh in1=seq_R1_001.fastq in2=seq_R2_001.fastq out=mapped.bam outu=unmapped.fq
bowtie2 -x reference -1 seq_trimmed1.fq -2 seq_trimmed2.fq -S align.sam
If you don't have coverage for certain regions in your data file then fiddling with parameters may not help. Doing more sequencing or additional libraries or using a different technology may be needed. While you could start playing fast and loose with parameters it may in turn affect regions where you do have good coverage/proper alignments.
What fraction of reads are remaining un-mapped with the three different alignments?
less than 1% of reads are going unmapped. I was thinking of just sequencing a few of the regions with Sanger
Have you considered indels or larger structural variants? You may have your variant hiding in plain sight if you're not looking for indels or SVs.
I would do an assembly - SPADES - then annotate roughly -prokka- then compare the nucleotide or aa sequences of the annotations - proteinortho.
You can also look at deleted genes in the two genomes by either manually scanning or using bedtools bamcoverage and filtering for 0 coverage regions.
Missing or gained genes might be more important than SNVs, especially if you only have 2 SNVs.
I've tried looking for indels specifically. I used pindel, which supposedly finds indels, as well as bbmap which was recommended to me here on biostars. I was able to identify an known insertion in another genome that I was working with, but nothing else popped up.
It's still possible that you have other types of SVs (e.g. inversions, duplications, translocations). Also, pindel can't detect insertions larger than ~100bp (it depends on the read length) so if you have a large insertion (e.g. of a new gene) you'll still miss it. A caller designed to detect complex rearrangements such as GRIDSS (displaimer: my software), would eliminate the other possibility but won't completely resolve the large insertion (new gene) issue. For that you'll want do to assembly. Since you genome is so small, assembly is not particularly problematic. You can either align the raw reads of one to the other, the contigs (e.g LASTZ), or a three-way to alignment to your reference genome.
Thanks for the suggestions re: SPADES, prokka etc. I'll give it a go!