Question

esearch discrepancies using bash commands vs direct esearch command line

0

Entering edit mode

4.8 years ago

frcamacho ▴ 210

Hi there, I am using esearch to query symbols within a file against NCBI. I am using the following command

cat test-file.txt | while read p; do echo $p; esearch -db gene -query '"$p"[Gene name] AND Homo sapiens[ORGN] AND alive[Prop]' | efetch -format tabular > results.txt; done;

In the test-file.txt I just have the id LAMB2P1.

The results I receive in the results.txt is:

> tax_id    Org_name    GeneID  CurrentID   Status  **Symbol**  Aliases description other_designations  map_location    chromosome  genomic_nucleotide_accession.version    start_position_on_the_genomic_accession end_position_on_the_genomic_accession   orientation exon_count  OMIM    
> 9606  Homo sapiens    4948    0   live    **OCA2**    BEY, BEY1, BEY2, BOCA, D15S12,
> EYCL, EYCL2, EYCL3, HCL3, P, PED, SHEP1   OCA2 melanosomal transmembrane
> protein   P protein|P-protein|eye color 2 (central brown)|eye color 3
> (brown)|hair color 3 (brown)|melanocyte-specific transporter
> protein|oculocutaneous albinism II (pink-eye dilution homolog,
> mouse)|pink-eyed dilution protein homolog|total brown iris
> pigmentation  15q12-q13.1 15  NC_000015.10    27719008    28099342    minus   30  611409  
> 9606  Homo sapiens    8706    0   live    **B3GALNT1**    B3GALT3, GLCT3, GLOB, Gb4Cer,
> P, P1, beta3Gal-T3, galT3 beta-1,3-N-acetylgalactosaminyltransferase 1
> (globoside blood
> group)    UDP-GalNAc:beta-1,3-N-acetylgalactosaminyltransferase 1|P
> antigen synthase|P blood group globoside|UDP-Gal:betaGlcNAc beta
> 1,3-galactosyltransferase, polypeptide 3 (Globoside blood
> group)|UDP-GalNAc:betaGlcNAc beta-1,3-galactosaminyltransferase,
> polypeptide 1 (Globoside blood
> group)|UDP-N-acetylgalactosamine:globotriaosylceramide
> beta-1,3-N-acetylgalactosaminyltransferase|b3Gal-T3|beta-1,3-GalNAc-T1|beta-1,3-GalTase
> 3|beta-1,3-galactosyltransferase
> 3|beta-3-Gx-T3|brainiac1|galactosylgalactosylglucosylceramide
> beta-D-acetyl-galactosaminyltransferase|globoside
> synthase|globotriaosylceramide
> 3-beta-N-acetylgalactosaminyltransferase  3q26.1  3   NC_000003.12    161083883   161105469   minus   10  603094

This is not correct because when I search the NCBI gene database I get a completely different record (https://www.ncbi.nlm.nih.gov/gene/?term=LAMB2P1) which is correct when I just run the esearch on the command line as follows:

esearch -db gene -query '"LAMB2P1"[Gene name] AND Homo sapiens[ORGN] AND alive[Prop]' | efetch -format tabular

Which gives me:

> tax_id    Org_name    GeneID  CurrentID   Status  **Symbol** Aliases  description other_designations  map_location    chromosome  genomic_nucleotide_accession.version    start_position_on_the_genomic_accession end_position_on_the_genomic_accession   orientation exon_count  OMIM
> 9606  Homo sapiens    22973   0   live    **LAMB2P1** LAMB2L  laminin subunit beta 2
> pseudogene 1  laminin, beta 2 pseudogene
> 1 3p21.31 3   NC_000003.12    49152859    49154401    minus   1

Any thoughts or ways that I can correct this would be much appreciated. I have to query 1k symbols and I would like to use bash commands to do so instead of querying one by one. Thanks!

esearch entrezdirect NCBI • 1.3k views

ADD COMMENT • link 4.8 years ago by frcamacho ▴ 210

1

Entering edit mode

This is also a heads up for those using esearch. The esearch command consumed the entire file (stdin), hence I got output only for the first line. I used IFS=$'\n'; for next in $(cat file); do ..; done to get it to work with a file as input.

ADD REPLY • link 4.8 years ago by frcamacho ▴ 210

1

Entering edit mode

Good catch - this post has a bit more detail on what's going on: Bash loop using Edirect (esearch|efetch)

TL;DR: The problem is with the while loop, not with esearch.

ADD REPLY • link 4.8 years ago by Ram 44k

Ram · Accepted Answer · 2020-02-04

3

Entering edit mode

4.8 years ago

Pierre Lindenbaum 164k

your using single quotes in your loop.

'"$p"[Gene name] AND Homo sapiens[ORGN] AND alive[Prop]'

variables like '$p' WON'T be replaced by their values.

see : https://stackoverflow.com/questions/6697753/difference-between-single-and-double-quotes-in-bash

ADD COMMENT • link 4.8 years ago by Pierre Lindenbaum 164k

2

Entering edit mode

Beat me to it. I was just running this test on bash:

bash-5.0$ p=LAMB2P1
bash-5.0$ echo $p
LAMB2P1
bash-5.0$ echo "$p"
LAMB2P1
bash-5.0$ echo '"$p"'
"$p"

ADD REPLY • link 4.8 years ago by Ram 44k

1

Entering edit mode

Thank you Pierre! Updated code and it works!

cat test-file.txt | while read p; do echo $p; esearch -db gene -query "$p[Gene name] AND Homo sapiens[ORGN] AND alive[Prop]" | efetch -format tabular > results.txt; done;

ADD REPLY • link updated 4.8 years ago by Ram 44k • written 4.8 years ago by frcamacho ▴ 210