Question

Retrieve all GenBank-PRI RNA molecules associated with a particular gene via entrez-direct

0

Entering edit mode

2.6 years ago

singlecell • 0

Hello everyone,

As stated in the post's title, I've being trying to retrieve the accessions of all GenBank (PRI division) RNA molecules associated with a particular human gene. In the case of TP53 gene, I've tried the following command:

esearch -db gene -query '(TP53[Gene Name]) AND "homo sapiens"[Organism]' | elink -target nuccore | efilter -division pri | efetch -format docsum | xtract -Pattern DocumentSummary -if MolType -equals rna -element Caption

This searchs returns a list of 183 accessions. Nevertheless, some other accessions that obviously fulfill the above criteria fail to be retrieved. For example, all of the accessions below are part of PRI, associated with TP53 at the "gene" section of the gb file and are of 'rna' MolType:

first example: OL856015.1

LOCUS       OL856015                1182 bp    mRNA    linear   PRI 18-JAN-2022
DEFINITION  Homo sapiens isolate TWH-2683-0-1 truncated mutant tumor protein
            p53 (TP53) mRNA, complete cds.
ACCESSION   OL856015
VERSION     OL856015.1
KEYWORDS    .
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 1182)
  AUTHORS   Kwong,A., Ho,C.Y.S., Shin,V.Y., Law,F.B.F., Chan,C.T.L. and
            Ma,E.S.K.
  TITLE     Mutation spectrum in Hong Kong
  JOURNAL   Unpublished
REFERENCE   2  (bases 1 to 1182)
  AUTHORS   Kwong,A., Ho,C.Y.S., Shin,V.Y., Law,F.B.F., Chan,C.T.L. and
            Ma,E.S.K.
  TITLE     Direct Submission
  JOURNAL   Submitted (14-DEC-2021) Dept of Surgery, Breast Division, The
            University of Hong Kong, Rm1401, Queen Mary Hospital, Pokfulam,
            Hong Kong NA, Hong Kong
COMMENT     ##Assembly-Data-START##
            Assembly Method       :: BWA-MEM v. v.0.7.7
            Sequencing Technology :: Illumina; Sanger dideoxy sequencing
            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..1182
                     /organism="Homo sapiens"
                     /mol_type="mRNA"
                     /isolate="TWH-2683-0-1"
                     /db_xref="taxon:9606"
                     /chromosome="17"
     gene            1..1182
                     /gene="TP53"
     CDS             1..1182
                     /gene="TP53"
                     /note="c.848G>A"
                        /codon_start=1
...

Same thing for MG595988.1 and at least 10 others

Any idea of what might be going wrong?

I've tried searching nucleotide directly (without linking from gene, using the Gene Name and rna_only filter), but neither this search returns all the accessions.

I'm trying to generalize this procedure for all protein-coding human genes. I haven't found any reliable solution so far; all the search combinations I've tried seem to be missing some accessions. Any suggestion would be welcomed.

Thanks

entrez-direct entrez genbank • 1.1k views

ADD COMMENT • link 2.6 years ago by singlecell • 0

0

Entering edit mode

Your query as posted above does not work for me. Are you sure it is correct?

What is your aim? Are you looking to get coding sequence and hence the RNA selection? You are only interested in Primate sequences?

As for OL856015.1 that is a truncated version of TP53 protein. MG595988 appears to be annotated as mutant protein so they may not be directly linked to full TP53 records where you started your search.

If I do this then I can get all of them:

$ esearch -db nuccore -query '"TP53"[Gene Name] AND "homo sapiens"[Organism]' | efilter -molecule mrna | efetch -format acc | grep -e OL -e MG
OL856015.1
MG595994.1
MG595993.1
MG595992.1
MG595991.1
MG595990.1
MG595989.1
MG595988.1
MG595987.1
MG595986.1
MG595985.1
MG595984.1
MG595983.1
MG595982.1
MG595981.1
MG595980.1
MG595979.1

ADD REPLY • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

Thanks a lot for your input.

You're absolutely right, there was a typo in the initial query, I just corrected the original post. I'm sorry for that.

As for your approach, it still does not return all the desired accessions, because, I assume, there are other types of RNAs as well. In the case of TP53 it returns 171 accessions, which is 12 accessions less.

In the example of the HRK gene, your search wouldn't return the non coding RNA accession NR_073189.3 and maybe other non-mrna RNAs as well. Surprisingly, "efilter -molecule ncrna" doesn't work either.

As for OL856015.1 that is a truncated version of TP53 protein. MG595988 appears to be annotated as mutant protein so they may not be directly linked to full TP53 records where you started your search.

That's true, but other truncated/mutated versions are retrieved normally (DQ485152, for example, is one of the 183 accessions of the original post).

What is your aim? Are you looking to get coding sequence and hence the RNA selection? You are only interested in Primate sequences?

I would like to retrieve all RNA sequences stored in Nucleotide (reference or not, coding or not, complete or not, mutated or not, but not ESTs - that's why I chose PRI) corresponding to the human TP53 gene.

ADD REPLY • link 2.6 years ago by singlecell • 0

0

Entering edit mode

See below. You will need to filter out NC* accessions since they are entire chromosomes (from any searches you do).

$ esearch -db nuccore -query '"HRK"[Gene Name] AND "homo sapiens"[Organism]' | efetch -format acc
NC_060936.1
NC_000012.12
NR_073189.3
NM_003806.4
CM000263.1
CH471054.1
BC111923.1
U76376.1

That NR accession is simply annotated as mol type RNA

$ efetch -db nuccore -id NR_073189 -format docsum | xtract -pattern DocumentSummary -element Biomol,MolType
transcribed-RNA rna

PRI means Primate division not primay entry.

ADD REPLY • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

Thank you for your response.

Unfortunately, neither this approach works. There are two problems:

a) "Gene name" does not always correspond to the HGNC name. For example, your search applied to the BACH1 gene, using "BACH1" as gene name, would return various accessions of the BRIP1 gene RNAs (XM_047436901.1 , for example), which happens to share BACH1 as a secondary, non HGNC name (ugh!)

b) I've already tested this approach for the DDC gene and it still misses many related PRI RNA accessions (AK298321.1 , for example)

As I'm trying to automate the procedure for every protein-coding human gene, I need an approach that will work all the times for all genes tested (maybe that's too ambitious!)

(The main reason I'm searching the PRI division specifically is to filter out EST accessions. Just figured out I might need to include the HTC division as well)

ADD REPLY • link 2.6 years ago by singlecell • 0

0

Entering edit mode

Searches are only as good as the underlying data so you may not always get what you wanted. You could start with all human RNA and then whittle those down further.

$ esearch -db nuccore -query "rna [MolType] AND human [orgn]" 
<ENTREZ_DIRECT>
  <Db>nuccore</Db>
  <WebEnv>MCID_62584e41eb94fb1cf726f5f4</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>1591470</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

Second column here reflects nomenclature symbols so that would be one way to focus on HGNC symbols.

$ esearch -db nuccore -query "rna [MolType] AND human [orgn]" | elink -target gene | esummary | xtract -pattern DocumentSummary -element Name,NomenclatureSymbol,OtherAliases,Accession | grep BACH1
BRIP1   BRIP1   BACH1, FANCJ, OF
BACH1   BACH1   BACH-1, BTBD24

vkkodali_ncbi generally has some good insights on EntrezDirect and may swing by to provide a useful answer. Good luck with your quest.

ADD REPLY • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

Thanks a lot for your recommendations. Much appreciated.

ADD REPLY • link 2.6 years ago by singlecell • 0

0

Entering edit mode

How is https://www.ncbi.nlm.nih.gov/nuccore/AK298321 related to DDC?

ADD REPLY • link 2.6 years ago by GenoMax 147k

0

Entering edit mode

This accession was retrieved from the following search:

esearch -db gene -query 'DDC[Gene Name] AND Homo sapiens[Organism]' | elink -target nuccore | efilter -molecule mrna -division pri | efetch -format acc

The only parts of the gb file which (indirectly?) refer to DDC (the protein product) are:

DEFINITION  Homo sapiens cDNA FLJ51516 complete cds, highly similar to
        Aromatic-L-amino-acid decarboxylase

...

/note="unnamed protein product; highly similar to
                     Aromatic-L-amino-acid decarboxylase (EC 4.1.1.28)"

There is no direct association with the DDC gene in the features section

ADD REPLY • link 2.6 years ago by singlecell • 0