I have two questions:
1) Can you reliably and robustly predict the absence of a gene (either missing entirely or being non-functional) from an organism simply by not finding it in an assembly based on whole genome sequencing from a bacterial cell culture taken from clinical samples?
2) If no, is the lack of a gene in an assembly still some sufficient degree of evidence for absence in the original biological context? Would you bet on the gene being absent in the organism if you did not find it in an assembly even if you knew it could not be robustly scientifically inferred?
I suspect the answer to these questions is no, because:
the sampling could have gone wrong (i.e. sampled one clone from an infection that contains multiple clones and this particular clone happens to lack the gene but not the other).
DNA extraction could have gone wrong, so even though the gene exists in the organism, it might not end up in the DNA that gets successfully extracted.
The kit used for converting the DNA into a form that can be sequenced on a specific sequencing platform might have been less than theoretically perfect.
The library happened to be low complexity.
The gene might be more difficult to sequence than other genes due to sequence biases.
The sequence quality for the reads from that gene might be of too low quality and be filtered out in the quality filtering step.
The gene might have features that makes it difficult to assemble or exist in multiple copies so that the assembly collapses it and the specific variant one is looking for might not be detected.
Due to the specific idiosyncrasies of the assembler, the gene happened to be split among many contigs.
The algorithm used to detect the gene from the assembly might have limitations.
The database you were using did not even contain the gene you were looking for.
...or any number of other biological or bioinformatics reasons.
In other words, there are so many things that could theoretically have gone wrong that it is unwise to claim that the gene is not in the organism just because it is not in an assembly.
Is this largely accurate? Would you consider it obviously flawed to conclude absence of a gene in the organism from the mere observation that it is not found in an assembly?
IMO if you have +20X coverage for your contigs and the assembly is of decent quality (~100 contigs or less, N50 > 50k) and blastn returns nothing, then conclusion is that the sequenced organism does not have that specific gene
This is disproved by cases where you have e. g. low complexity library. Entirely possible to have high coverage, few contigs and high N50, but still missing a considerable part of the genome.
Just switching assembler could also change the gene content by hundreds of protein-encoding genes, albeit a bit old paper: https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0021400
I don't know about eukaryote genomes, but I disagree with that you can have high coverage, few contig, high N50 prokaryote genome assembly (you defined bacteria in OP) and still miss a considerable part of the genome. "Hundreds of genes" is less than 1% of all genes in the context of your typical mammalian genome..
The scenario that I outlined was that due to a low complexity library, you only have, let us for the sake of argument say half of the genome that you have sequenced over and over (approximately twice the depth). Your N50 would be large (good coverage), you would have few contigs and high coverage.
That was only an example of the impact of assembler choice. Hundreds of genes might be small as a proportion of all total genes, but it could impact analyses.
I haven't done any wetlab stuff for a very long time, but IMO starting from a cell culture, it would take extraordinary skill to somehow manage to extract the DNA covering just 50% of some prokaryotic genome..
Dear bioinfo2345, you're stretching all theoretical possible problems to their very extreme.
In its extreme case, there is truth in your statement. I wouldn't use "unwise", but there is no absolute guaranty for the absence of a gene being true. There almost never is. As an absence of a PCR band doesn't prove the absence of the amplification target.
BTW, if you know the sequence of your gene of interest (which you must, otherwise you couldn't make assumptions on its presence in the first place), all you need is a PCR. That is, you don't have the extraordinary skills 5heikki mentioned. As mentioned above, no band doesn't prove the absence, but adds evidence.
Finally, much depends on the effort you put into it. A lousy DNA prep is the foundation of a lousy assembly, a sloppy designed primer set decreases the chances for a successfull PCR. On the contrary, you can do targeted sequencing to focus on your gene, up to the point of primer walking ;-)
There's no guaranty, but there are so many options
PCR can handle the absence of evidence problem by also running the sample with other primers in such a way that you will get another predictable product if the gene is missing.
There are many cases where you would like to identify thousands of genes per sample in a rapid and high-throughput way. As you can probably imagine, saying "all you need is a PCR" is not adequate here.
I am sure you are aware that there are many experimental papers out there that are based on errors in the lab of various kinds and there are many steps that can go wrong or be incompletely carried out.
Another thing you can do is to pick a reference genome that is the closest to your sequenced genome. Then you map the reads from your genome on the reference genome and see what is missing (other than your gene of interest that is)..
It is a good idea but unfortunately genome sizes varies substantially within species' of interest. It is also still an evidence of absence argument.
Do you think synteny could be robustly used to estimate gene absence? Assuming that the order of genes are:
gene A - gene B - gene C
then is finding a contig with gene A and gene C, but no contig with all three sufficient to argue for gene loss?
Extract your gene of interest +/- 10k bp region from a large number of reference genomes of your species. Do they align nicely? If yes, that region is conserved across your species. Now map your reads against this region. In case you see something like "|||||||||||||.....|||||||||||||||" (with pipes indicating good coverage and dots indicating absence of coverage) the conclusion is that your sequenced organisms includes the conserved region, but it is missing a part of it. Then you have the 0.00001% chance that your organism has this region but it has been moved into another part of the genome and somehow magically in DNA lib preparation/sequencing/whatever this particular region was left out. You're always going to have uncertainty..