Question

How to find the mutations of a bacterium without reference genome from paired-end short-reads?

0

Entering edit mode

3.0 years ago

eli_bayat ▴ 90

Hi,

I have a bacteria sample and I want to align this sample to a reference genome. However, a reference genome for this particular strain is not available. So, I need to generate a reference genome from my 2X150 paired-end reads fastq files. I have a few questions related to this:

From my understanding, the first step after trimming adapters, is to use de novo assembly to get contigs, and then generate consensus sequence from the assembly that represent the reference genome. Could you let me know if I am correct or not?
If above 1 is correct, I know there are different tools that do assembly, for example velvet, trinity, spades. I checked spades to see if it output consensus sequence and I couldn't find this. Is there any tools that does assembly and consensus generation? or is there any separate tool for reference genome generation either from assembly or fastq files?
If above 1 is not correct, could you please help me to find what I need to do to generate a reference sequence from paired end fastq files?

Thanks

reference paired-end-reads consensus genome • 1.6k views

ADD COMMENT • link updated 2.9 years ago by shenwei356 8.7k • written 3.0 years ago by eli_bayat ▴ 90

score 2 · Answer 1 · 2021-12-10

2

Entering edit mode

3.0 years ago

shenwei356 8.7k

You can use Spades for de novo assembly, the result would be contigs.fasta, with which you can filter long sequences for mapping.

I'd recommend sequencing using Pacbio or nanopore to get a complete genome for better downstream analysis.

If you just want to find some mutations, you can map to a close related public reference with control and experiment samples, and find mutation with Breseq, and compare results of control-experiments samples.

ADD COMMENT • link 3.0 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you for the suggestion. I ran Spades on the paired-end reads and got contigs fasta file. The generic reference genome is ~2Mbp and my largest contig is 541,750 bp. Should I select largest contigs until I hit 2Mbp and call it good?

pacbio sequencing is a good approach but we don't have access to this technology:)

ADD REPLY • link 3.0 years ago by eli_bayat ▴ 90

0

Entering edit mode

Should I select largest contigs until I hit 2Mbp and call it good?

No, you need more. I'd keep all longer than 1kb for mapping.

And what's the next after mapping in your plan?

ADD REPLY • link 3.0 years ago by shenwei356 8.7k

0

Entering edit mode

okay, thanks so much, the plan is to map samples to this reference and find mutations. Is there anything else that you think could help the analysis? :) I appreciate your help

ADD REPLY • link 2.9 years ago by eli_bayat ▴ 90

1

Entering edit mode

If you just want to find some mutations, you can map to a close related public reference with control and experiment samples, and find mutation with Breseq, and compare results of control-experiments samples.

For the experiment, a control sample is very important, which should be either the original frozen strain or the group without interfering conditions during the experiment.

To find the closest reference, you may use BLAST (less recommended, cause it's local alignment), or Mash/Sourmash (whole-genome distance) to alignment contigs to Refseq/Genbank sequences.

There should have many differences between your sample and the closest reference. Don't worry, I've written a tool, breseq-rm-bg, to remove background (control) mutations from the Breseq results.

ADD REPLY • link 2.9 years ago by shenwei356 8.7k