HI,
Can anyone suggest me methodologies for extracting the complete sequence for the low quality predicted protein sequence reported in refseq database or NCBI protein database?
1)I have whole genome data of more than 50X coverage. When I do blast search (with human ortholog) against the SRA data I get many sequences because my gene of interest has 4 other similar protein sequences with approx 40% sequence identity .
2) the assembly available has missing residues at the exon regions.
My aim is to find the cDNA sequence so i could clone and characterize the protein by experimental methods
Thank You for your help. Kumar
You ca use something like
backtranseq
from EMBOSS. Here is a link to web interface for the tool. You can obviously run it from command line if you want to by installing EMBOSS.Thanks for your suggestion. I actually used tblastn to search for the sequences. The problem is missing residues in the sequence. I am 100 % sure that the gene of my interset is present in the other species. Out of 650 amino acids, i mostly get regions covering 600 amino acids. But, this is not sufficient for generating the clone. What i dont understand from the assemblies is, even after 50X coverage, why there are still "NNNNNNNN" regions in the assemblies.