How Do Researchers Choose A Reference Genome For A Novel Bacterial Strain Assembly?
2
4
Entering edit mode
12.6 years ago
tanjafiegel ▴ 40

Could someone please let me know how one makes the best informed decision on choosing a reference genome to assemble a novel bacterial strain in the real "world of bioinformatics?

Is it appropriate to assemble raw sequence data into contigs, then' blastn' one of the larger contigs to find a similar strain and attempt reference genome assembly with that 'match'?

Is it then informative to find the ORFs with Glimmer3, or will the assembled consensus sequence be actually uninformative as it will contain parts of the reference genome?

What about the 'un-assembled contigs that are left? What do people usually do with those? Chuck them in the recycling or try and find some annotation for those?

Could I also ask if people mostly run Glimmer3 on the finished consensus sequence or on the contigs assembled from the raw seq reads?

Many thanks!

assembly • 5.0k views
ADD COMMENT
4
Entering edit mode
12.6 years ago
Raquel Tobes ▴ 160

I think that, for bacterial genomes, de novo assembly is always better since the assembly using a reference genome inevitably causes bias to the reference genome.

ADD COMMENT
1
Entering edit mode
12.6 years ago
ALchEmiXt ★ 1.9k

What we usually (as in not always) do is that we de novo assemble the genome using various settings depending on the sequencing technique used (i.e. kmer size for illumina data).

  • If PE of mate-pair is available build scaffolds (de novo) of that.
  • If these are not available we use the contigs by itself:

    • we scaffold the contigs based on a closely related strain or species (which is dangerous because it could be different). The strain is either known from expert biologist or can be identified by homology searching using a chronologically joined artificial chromosome. We BLAT or BLAST the contigs to a ref or use MUMmer tiling; layout the contigs in order and orientation and just add the non-mapped contigs at the end.
    • We then link the laid-out contigs using artificial linkers clearly separating the contigs but also containing all six-frame start-stops.
  • Predict ORFs using genemarkHMMp and prodigal and compare these results to identify erroneous or missed calls.

  • If possible confirm these CDS using RNAseq experiments
  • Do further annotation and straincomparisons...

My 2ct.

ADD COMMENT

Login before adding your answer.

Traffic: 2676 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6