Is it a problem that the reference genome is not at the chromosome level?
1
2
Entering edit mode
5.3 years ago
beausoleilmo ▴ 600

I'm studying a species where there is a reference genome that is assembled only at the scaffold level ("unplaced scaffolds"). See here https://www.ncbi.nlm.nih.gov/genome/annotation_euk/Geospiza_fortis/101/.

My question is

  • Do people generally treat a reference genome at the scaffold level as if each scaffold would be a chromosome?
  • Should a scaffold vs chromosome level reference genome be treated differently?
  • What are the main challenges for using a reference genome that is only at the scaffold level?

Basically, I often read in population genetics textbook that we have to study "chromosome". But I have a hard imagine, when having only scaffold, how the theory applies differently.

Chromosome reference genome scaffolds • 2.1k views
ADD COMMENT
2
Entering edit mode
5.3 years ago
Brice Sarver ★ 3.8k

All bioinformatic applications, including mapping (to a FASTA), will proceed as usual. For annotation, you'll also be fine as long as the contig/scaffold names are the same as the annotation file you're using. This isn't a major problem - the human reference, for example, has a number of unlocalized scaffolds and patch scaffolds that are relevant for annotation but aren't in the set of more thoroughly characterized chromosomes. Contigs/scaffolds/chromosomes are often treated the same by most applications.

So, to answer your questions:

  1. More-or-less.
  2. Not really, but realize that your analysis may be impacted if all the scaffolds can't be localized to true chromosomes (e.g., are two scaffolds in LD because they're next to each other in reality?).
  3. Outlined above, but they're generally captured under issues resulting from 'assembly uncertainty' sensu lato.
ADD COMMENT
0
Entering edit mode

Thanks for the answer! I guess that one trick that could be used is LiftOver to try to match the scaffolds to chromosomes of a closely related species (like the more detailed reference genome of the Zebra finch). But probably, LiftOver comes with its own challenges.

ADD REPLY
1
Entering edit mode

You could also attempt to localize scaffolds to closely-related species with more sophisticated references using BLAT, if LiftOver (or CrossMap or your favorite alternative) won't work because there's no good whole-genome alignments to create a chainfile. This will work best if the scaffolds are short, else you may need to attempt a whole-genome alignment anyway. A (very) quick peek at UCSC doesn't have Geospiza as a source species.

A quick search reveals that these two species have a median divergence of 30 million years, so you're going to expect quite a few differences.

ADD REPLY
0
Entering edit mode

Thanks!! Cool! When you look here https://genome.ucsc.edu/cgi-bin/hgLiftOver, you have to look for "medium ground finch". I don't know why they kept the common english name, but that's how they recorded it...

May I ask you how you got quickly the divergence and what time of divergence would be "roughly" interesting to try to map the scaffolds on chromosomes?

ADD REPLY
1
Entering edit mode

Thanks! Didn't see that.

This site provides quick estimates aggregated across a few studies. It generally works as a decent starting place if you don't have the data or computational power to do a full dating analysis, though you can always compare with your favorite paper.

What will map well is a function of the divergence of your sample relative to the reference. You can change mapping parameters from the defaults to let more mismatches through, but you're increasing the uncertainty in your results. You can get a sense of what will be tolerated if you have estimates of substitution rates for the class of loci you're looking at (e.g., intron, exon, UTR, etc.). This can be pretty tricky in practice, but you could try mapping your reads to the exome of a better-annotated species and see what comes out.

For larger sequences with greater divergence, BLAT will be your friend. If your scaffolds are really large, you'll want to look into genome alignments to infer homology.

ADD REPLY
0
Entering edit mode

Amazing! Again, thanks a lot for your very nice answer! I appreciate you explained and went further to give me a new intuition on how to approach the problem!!

ADD REPLY

Login before adding your answer.

Traffic: 1614 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6