Hi,
I have a bacteria sample and I want to align this sample to a reference genome. However, a reference genome for this particular strain is not available. So, I need to generate a reference genome from my 2X150 paired-end reads fastq files. I have a few questions related to this:
From my understanding, the first step after trimming adapters, is to use de novo assembly to get contigs, and then generate consensus sequence from the assembly that represent the reference genome. Could you let me know if I am correct or not?
If above 1 is correct, I know there are different tools that do assembly, for example velvet, trinity, spades. I checked spades to see if it output consensus sequence and I couldn't find this. Is there any tools that does assembly and consensus generation? or is there any separate tool for reference genome generation either from assembly or fastq files?
If above 1 is not correct, could you please help me to find what I need to do to generate a reference sequence from paired end fastq files?
Thanks
Thank you for the suggestion. I ran Spades on the paired-end reads and got contigs fasta file. The generic reference genome is ~2Mbp and my largest contig is 541,750 bp. Should I select largest contigs until I hit 2Mbp and call it good?
pacbio sequencing is a good approach but we don't have access to this technology:)
No, you need more. I'd keep all longer than 1kb for mapping.
And what's the next after mapping in your plan?
okay, thanks so much, the plan is to map samples to this reference and find mutations. Is there anything else that you think could help the analysis? :) I appreciate your help
For the experiment, a control sample is very important, which should be either the original frozen strain or the group without interfering conditions during the experiment.
To find the closest reference, you may use BLAST (less recommended, cause it's local alignment), or Mash/Sourmash (whole-genome distance) to alignment contigs to Refseq/Genbank sequences.
There should have many differences between your sample and the closest reference. Don't worry, I've written a tool, breseq-rm-bg, to remove background (control) mutations from the Breseq results.