I have a list of several hundreds of bacterial protein identifiers (e.g., WP_040242412.1), and I need to find which ORF follows downstream. Using Entrez Direct I can get genome identifiers, coordinates and strand orientation for ORFs for these proteins (e.g., WP_040242412.1: NZ_CDGZ01000012.1 109108 110151 -), but I do not know how to proceed further and inquire the databases about a downstream ORF.
All suggestions to crack this issue will be appreciated.
I will do the following:
1. Get the entire mRNA sequence for your interested proteins which must has 5'UTR,CDS and 3'UTR.
2. Later using deep sequencing data find whether you get any coverage in 3'UTR3.
3. If you get any coverage, then extract those coverage region and validate them computationally(blast, interproanalysis, database searching, etc)
The above steps are not the gold standard but it can be a good starting point