[Experimental Design] Is it possible to determine completeness of a genome if you were given raw reads?
1
1
Entering edit mode
9.1 years ago
Tom ▴ 20

So here's the dilemma. I have illumina raw reads from a new undiscovered species of bacteria, and I'd like to assemble them as a draft genome. However, I don't know if my sequencing machine was able to cover 100% of the genome. I suspect it may be only 98% complete, and there may be gaps and artifacts that my runs missed and could not sequence. I want an exact number, because this 98% is a qualitative guess. However, all I have are a bunch of raw reads. I think they cover the genome an average of 25x, which is good. But, Is it even computationally possible to determine the quality/completeness of your assembly based on just raw reads? How should I change my approach to this problem?

sequencing genome theory coverage • 2.1k views
ADD COMMENT
2
Entering edit mode
9.1 years ago
iraun 6.2k

The 'completeness' of a genome is an abstract concept not easy to check.

For one hand, you can try to assemble your reads using a de-novo approach, and extract general statistics just to have a general idea about the assembly (number of scaffolds, mean length, N50...).For other hand, you can compare the size of your de-novo assembled genome to the size of a phylogenetically closed bacteria specie with a well assembled genome and see if they are similar. Also, maybe I'd try to map the assembled scaffolds against the closed bacteria genome, and calculate % the genome covered.

This is what I'd do in your case... but for sure there are another things to do, and as I said, the completeness of an assembly is not something easy to know. Also is important to consider the 'genome mappability', which depends on each genome and affects the assembly.

ADD COMMENT
0
Entering edit mode

What would you recommend as software to visually see the contigs/reads lined up with the assembled genome scaffold? I was thinking that if I had a visual like this: http://www.dartergenomics.org/tallapoosa-darter-genome, or this http://gcat.davidson.edu/phast/img/coverage.png, that it would help me see which regions I can guess are missing.

ADD REPLY

Login before adding your answer.

Traffic: 1775 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6