Hi everyone! I'm quite new in this field, so I need help because I don't understand whole pipeline for my task. My lab has sequenced (Illumina, paired-end) two strain of M.tuberculosis. One of them expected to be the control, second contains mutations which I have to find. These mutations could be snp, deletions or large translocations. I tried to assemble genomes de novo using SPAdes (within unicycler) but there are a lot of contigs and it's difficult to compare between. Now I began to think that I can use information about the M.tuberculosis genome from ncbi (my control strain should be very similar to that one). But I don't really understand, is it correct in this case? If so, then I should use reference guided assembly and provide m.tuberculosis as trusted contigs to SPAdes? Or should I just mapped my final contigs on reference genome? The second my thought was just to sequence my strains again but using nanopore to generate long reads and finalize assembly. Please, tell me which pipeline should I use in my case? How can I find differences without accidentally losing information during assembly? Thanks to all!
Have you checked your assemblies with
quast
(LINK)? If you can post stats on your assemblies here it may be possible for us to give you some advice. How much data did you use for the assemblies? Having too much data can be detrimental to getting good assemblies, contrary to popular belief.If your assemblies can't be improved further then this may be what you would need to do.
But even then it may still be possible to identify SNP's and other variations from the data you have in hand, as long as you can identify genes/regions with certainty. You may not get a complete answer but at least a usable one.
Thank you for your answer! Yes, here I attach quast report for my two assemblies: enter link description here It seems to me that assemblies are quite good - I have about 110 contigs for each sample (I used about 6 and 4 million pairs of 100 bp reads correspondingly). And my N50 - 125378 and 125378 (genome length - 4.4 Mb). I don't understand completely how to interpret N50 value, I know that the bigger the better, but that's all. Right now I tried to compare my assemblies with each other and with reference genome using Mauve, but it looks not very nice - a lot of misassembled contigs.
Something weird is going on, you have two good assemblies it shouldn't look that scrambled. Did you map the contigs onto the reference genome before align then with mauve ? It appears that you did not reorder the contigs of your assemblies with the reference, because they kind of look like ordered by size (as they came out of spades). 🤔
Yes, thank you! Of course, It was stupid. Now it's much better.
hahahah don't say that everybody makes mistakes. Now you can map with an aligner (like bowtie2) your reads onto your assemblies to see if those recombination blocks are real or miss assemblies.