Entering edit mode
5.1 years ago
kkumarreddy
•
0
Hello everyone,
What is the systemic approach to find protein coding regions from the whole genome data when refseq protein data is incomplete.
Thank You
kumar
Depending on kind of genome you are working with, you would either need to run a gene prediction tool followed by translations (eukaryotic) or find open reading frames and translate them (prokaryotic).
How did you decide that the RefSeq protein data is incomplete? Are you looking at RefSeq genomes or individual RefSeq proteins records?
Thank you for the Reply. I am working with eukaryotic organisms. What would be the Input for gene prediction Tools (SRA data or Assembly data) along with our Protein of interest.
I did a BLAST search for my gene against refseq databaseselecting my organism of interest. I checked refseq genomes as well as individaul Protein refseq files. Protein sequence have few amino acids missing and Genome files have "NNNNN" at respective positions
What are you trying to achieve? It sounds like the protein is present but the sequence might be truncated/missing due to incomplete assembly.
Do you have WGS assembly with a complete nucleotide sequence of the gene interest?
Thank You for the response. Yes, I do have WGS assembly, however few regions in my gene of interest region are missing due to incomplete assembly. To find out those residues, I have taken refseq protein sequence of my interest (truncated version) and blasted it against whole genome SRA reads using blast+ after processing them with sratoolkit. The problem is I am not getting hits for few of the protein regions. Similar results i am getting even when i use full sequence as a query from closely related organism.
My aim is to find out the complete protein/cDNA sequence for my gene of interest in different organisms using WGS data.
Right. Okay that makes sense. From a very brief outlook I think that gene is hard to assembly. You should consider long-read data or maybe using sanger sequencing (custom primers to amplify the sequence + sanger sequencing) to try to get a better picture of the nucleotide composition in the regions that are missing. It doesn't sound like there are any more info on your gene of interest in public repos.
Thanks you. Do you think are there any other approaches that I can explore. For few of the organisms, there is close to 100X coverage. I am asking this because, I have a large list of organisms. Ya, I am exploring the option of long reads. for one organism i am using pacbio sequencing data. yet to finish it. No, the full sequence of the gene is available for many organisms. And most of the refseq data of different organisms contains truncated version of the protein sequence. Custom primers to amplify the sequence is my last option because of the many organisms and the procedures involved in the collection of biological samples.