Hello everyone,
As stated in the post's title, I've being trying to retrieve the accessions of all GenBank (PRI division) RNA molecules associated with a particular human gene. In the case of TP53 gene, I've tried the following command:
esearch -db gene -query '(TP53[Gene Name]) AND "homo sapiens"[Organism]' | elink -target nuccore | efilter -division pri | efetch -format docsum | xtract -Pattern DocumentSummary -if MolType -equals rna -element Caption
This searchs returns a list of 183 accessions. Nevertheless, some other accessions that obviously fulfill the above criteria fail to be retrieved. For example, all of the accessions below are part of PRI, associated with TP53 at the "gene" section of the gb file and are of 'rna' MolType:
first example: OL856015.1
LOCUS OL856015 1182 bp mRNA linear PRI 18-JAN-2022
DEFINITION Homo sapiens isolate TWH-2683-0-1 truncated mutant tumor protein
p53 (TP53) mRNA, complete cds.
ACCESSION OL856015
VERSION OL856015.1
KEYWORDS .
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 1182)
AUTHORS Kwong,A., Ho,C.Y.S., Shin,V.Y., Law,F.B.F., Chan,C.T.L. and
Ma,E.S.K.
TITLE Mutation spectrum in Hong Kong
JOURNAL Unpublished
REFERENCE 2 (bases 1 to 1182)
AUTHORS Kwong,A., Ho,C.Y.S., Shin,V.Y., Law,F.B.F., Chan,C.T.L. and
Ma,E.S.K.
TITLE Direct Submission
JOURNAL Submitted (14-DEC-2021) Dept of Surgery, Breast Division, The
University of Hong Kong, Rm1401, Queen Mary Hospital, Pokfulam,
Hong Kong NA, Hong Kong
COMMENT ##Assembly-Data-START##
Assembly Method :: BWA-MEM v. v.0.7.7
Sequencing Technology :: Illumina; Sanger dideoxy sequencing
##Assembly-Data-END##
FEATURES Location/Qualifiers
source 1..1182
/organism="Homo sapiens"
/mol_type="mRNA"
/isolate="TWH-2683-0-1"
/db_xref="taxon:9606"
/chromosome="17"
gene 1..1182
/gene="TP53"
CDS 1..1182
/gene="TP53"
/note="c.848G>A"
/codon_start=1
...
Same thing for MG595988.1 and at least 10 others
Any idea of what might be going wrong?
I've tried searching nucleotide directly (without linking from gene, using the Gene Name and rna_only filter), but neither this search returns all the accessions.
I'm trying to generalize this procedure for all protein-coding human genes. I haven't found any reliable solution so far; all the search combinations I've tried seem to be missing some accessions. Any suggestion would be welcomed.
Thanks
Your query as posted above does not work for me. Are you sure it is correct?
What is your aim? Are you looking to get coding sequence and hence the RNA selection? You are only interested in Primate sequences?
As for
OL856015.1
that is a truncated version of TP53 protein.MG595988
appears to be annotated as mutant protein so they may not be directly linked to full TP53 records where you started your search.If I do this then I can get all of them:
Thanks a lot for your input.
You're absolutely right, there was a typo in the initial query, I just corrected the original post. I'm sorry for that.
As for your approach, it still does not return all the desired accessions, because, I assume, there are other types of RNAs as well. In the case of TP53 it returns 171 accessions, which is 12 accessions less.
In the example of the HRK gene, your search wouldn't return the non coding RNA accession NR_073189.3 and maybe other non-mrna RNAs as well. Surprisingly, "efilter -molecule ncrna" doesn't work either.
That's true, but other truncated/mutated versions are retrieved normally (DQ485152, for example, is one of the 183 accessions of the original post).
I would like to retrieve all RNA sequences stored in Nucleotide (reference or not, coding or not, complete or not, mutated or not, but not ESTs - that's why I chose PRI) corresponding to the human TP53 gene.
See below. You will need to filter out
NC*
accessions since they are entire chromosomes (from any searches you do).That
NR
accession is simply annotated as mol type RNAPRI means Primate division not primay entry.
Thank you for your response.
Unfortunately, neither this approach works. There are two problems:
a) "Gene name" does not always correspond to the HGNC name. For example, your search applied to the BACH1 gene, using "BACH1" as gene name, would return various accessions of the BRIP1 gene RNAs (XM_047436901.1 , for example), which happens to share BACH1 as a secondary, non HGNC name (ugh!)
b) I've already tested this approach for the DDC gene and it still misses many related PRI RNA accessions (AK298321.1 , for example)
As I'm trying to automate the procedure for every protein-coding human gene, I need an approach that will work all the times for all genes tested (maybe that's too ambitious!)
(The main reason I'm searching the PRI division specifically is to filter out EST accessions. Just figured out I might need to include the HTC division as well)
Searches are only as good as the underlying data so you may not always get what you wanted. You could start with all human RNA and then whittle those down further.
Second column here reflects nomenclature symbols so that would be one way to focus on HGNC symbols.
vkkodali_ncbi generally has some good insights on EntrezDirect and may swing by to provide a useful answer. Good luck with your quest.
Thanks a lot for your recommendations. Much appreciated.
How is https://www.ncbi.nlm.nih.gov/nuccore/AK298321 related to DDC?
This accession was retrieved from the following search:
The only parts of the gb file which (indirectly?) refer to DDC (the protein product) are:
...
There is no direct association with the DDC gene in the features section