Dear all,
This question might be seen as an extension of earlier questions (e.g. https://www.biostars.org/p/5801/)..
I’m currently facing some problems while screening sequence data (both full assembled genomes and “short” sequences, of e.g ncbi nucleotide database) for presence of gene family members (in this case P450). The results I get are very disappointing and I am wondering what I might be doing wrong, that’s the reason why I ask for your expertise…
The steps I take now are roughly: 1. downloading sequence data from NCBI for specific clades 2. Extracting potential protein-coding genes from the raw nucleotide sequences using getORF. This off course results in a lot of rubbish, and to minimize this I set “minimal size” to 450 bp. 3. Running pHMMs and subsequent blastP searches based on a selection of P450 protein sequences from the Pfam database to find potential P450 gene family members from the downloaded sequence data.
As the ideal test I performed this method on the raw nucleotide genbank genome assembly of Helicoverpa armigera (NCBI ID= GCA_002156985.1) for which all coding proteins are known (e.g. for P450, https://www.ncbi.nlm.nih.gov/genome/proteins/13316?genome_assembly_id=319039). So I downloaded all the genome scaffolds and used getORF to extract all ORF’s (in theory this includes all the putative protein-coding genes). I ran HMM and blastP to subsequently extract the potential P450 protein coding genes from this ORF dataset. This resulted in +/- 50 HMM hits. However, from the genome, 116 P450 proteins are known! So, my method seems not to find the correct number of potential P450 genes. Just to make sure that my HMM and blastp method was ok: I did check if it would “hit” the P450 protein from the already annotated protein dataset, and it did.. HMM and blastp nicely found all P450 protein sequences from this dataset. But than why do I only find 50 potential P450 genes from the ORF dataset? Is getORF not the method to extract potential protein genes from raw nucleotide sequences (among all non-protein coding sequences that is extracted along with it)?
Thanks in advance for all your help here!