Entering edit mode
9.9 years ago
chrisclarkson100
▴
160
I am reading a paper that describes the identification of "single copy genes" in plant species.
I'm trying to understand why the below described process is useful:
To establish a useful criterion for declaring a gene as single copy, each of the five data sets was blasted against itself using BLASTN
If the gene has been duplicated, there will be 2 of it naturally so how does blasting it against itself tell you that there's two or more of it?
When referring to a paper, you should provide a link or PMID to the article.
Because if you have a single copy of a gene, you won't find more than one hit against it. Basically if there's only one gene X in a genome and I blast gene X against that genome, I would expect to find only one hit for it. If there's two copies of X, you'd expect to find two blast hits for it and so on.
If I have a bag of colored balls and there's only one green ball, when I look in the bag for green balls I should only find one.
Please note this refers to the reference genome, which is a theoretical construct, not the real genome of any particular cell. In human, a 'single copy' gene will probably have two copies on an autosome, or one or two on a sex-chromosome. And many genes will be multicopy in the individual regardless of the reference genome. The article chrisclarkson20 found is talking about artificial maps.
Yes of course, this would be the 'monoploid' genome (which may or may not be the biological reality), however I was always under the impression that this is what the term "genome" meant.
http://ghr.nlm.nih.gov/handbook/hgp/genome:
I guess it depends on how you interpret "set", in a strict sense it means (imo) that the additional copies contributed by polyploidy (n>1) wouldn't be included. However duplicates of a gene within a chromosome would be included since they're distinct genetic elements.
For what it's worth wikipedia has this to say:
Although, as long as the ploidy is both known and known to be consistent for the cells/tissue you should be able to apply the same method by dividing by the ploidy number. E.g. if you have diploid data, a gene with two copies should find 4 BLAST hits.
Using the example above, it is similar to having two bags filled with the same mix of balls, if there's one green ball per bag, you expect to find two total but still only one per bag. If you have two red balls per bag, you'd expect to find four after looking in the two bags, but that still means there's two per bag.
Barring noise and issues that may come up from isoforms (which could confuse the BLAST process), you should be able to use this approach with just about any sequencing data from any kind of organism as long as you know the ploidy of your source material and know that it is consistent enough to average out noise from any weirdness.
I also realize that many classes of genes (e.g. transcription factors) will have more than one copy and indeed it seems in plants to be a very large portion. The point of the post was to try and demonstrate the general concept, not to get into details of the biology that can vary widely depending on the species in question.
I believe this is the paper you are referring to. I am suspecting that you are misunderstanding the "itself" part. It refers to the data set not the individual gene. An all-against-all BLASTN search was performed for all the genes in a data set (e.g. Arabidopsis) with an e-value threshold of 1e-10. Genes that do not have any BLAST hits are considered as single-copy genes.
yes of course sorry that is indeed the paper. much better thank you