I am trying to extract a gene marker, say, COII (cytochrome oxidase subunit II) (e.g., NCBI: JQ319797), from a sequenced but unassembled insect genome, Musca domestica (house fly) (NCBI: AQPM00000000). I downloaded all the scaffolds, created a local BLAST database, and blastned a COII homolog against this database. The result includes multiple hits, each looking like a real COII, but they slightly differ from each other in sequence. I thought one genome should only contain one version of gene, but the result seems to be a mixture. I am asking if it is an artifact due to the imcompleteness of the genome sequencing project, or I did something wrong, or, since it is a mitochondrial gene, there are supposed to be different versions within an insect? And how should I correctly get this gene marker from the genome? Thank you!
were the read counts for different variants comparable? also I would suggest changing the name of the question
Yes they are comparable. They align well, with several to a dozen polymorphisms, and sometimes short gaps. How would you suggest the new title to be?
first you resolve mitochondrial reads, and try assemble it and annotate mitochondrial genome with mitochondrial annotation servers like MITOS, DOGMA etc, it will annotate COII gene.
or you just design primers specific to this gene for insects and sequence this gene.
Thanks for your idea! However, I also need some nuclear genomes. And the genomic data does not indicate which reads are from mitochondria. Doing sequencing (you mean experimentally, right?) is not an option for me now, as we don't have those insect samples.
you just simply map your reads against closest reference mitochondrial genome, this is how you can resolve mitochondrial reads.