Hi there, I am using esearch to query symbols within a file against NCBI. I am using the following command
cat test-file.txt | while read p; do echo $p; esearch -db gene -query '"$p"[Gene name] AND Homo sapiens[ORGN] AND alive[Prop]' | efetch -format tabular > results.txt; done;
In the test-file.txt I just have the id LAMB2P1.
The results I receive in the results.txt is:
> tax_id Org_name GeneID CurrentID Status **Symbol** Aliases description other_designations map_location chromosome genomic_nucleotide_accession.version start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation exon_count OMIM
> 9606 Homo sapiens 4948 0 live **OCA2** BEY, BEY1, BEY2, BOCA, D15S12,
> EYCL, EYCL2, EYCL3, HCL3, P, PED, SHEP1 OCA2 melanosomal transmembrane
> protein P protein|P-protein|eye color 2 (central brown)|eye color 3
> (brown)|hair color 3 (brown)|melanocyte-specific transporter
> protein|oculocutaneous albinism II (pink-eye dilution homolog,
> mouse)|pink-eyed dilution protein homolog|total brown iris
> pigmentation 15q12-q13.1 15 NC_000015.10 27719008 28099342 minus 30 611409
> 9606 Homo sapiens 8706 0 live **B3GALNT1** B3GALT3, GLCT3, GLOB, Gb4Cer,
> P, P1, beta3Gal-T3, galT3 beta-1,3-N-acetylgalactosaminyltransferase 1
> (globoside blood
> group) UDP-GalNAc:beta-1,3-N-acetylgalactosaminyltransferase 1|P
> antigen synthase|P blood group globoside|UDP-Gal:betaGlcNAc beta
> 1,3-galactosyltransferase, polypeptide 3 (Globoside blood
> group)|UDP-GalNAc:betaGlcNAc beta-1,3-galactosaminyltransferase,
> polypeptide 1 (Globoside blood
> group)|UDP-N-acetylgalactosamine:globotriaosylceramide
> beta-1,3-N-acetylgalactosaminyltransferase|b3Gal-T3|beta-1,3-GalNAc-T1|beta-1,3-GalTase
> 3|beta-1,3-galactosyltransferase
> 3|beta-3-Gx-T3|brainiac1|galactosylgalactosylglucosylceramide
> beta-D-acetyl-galactosaminyltransferase|globoside
> synthase|globotriaosylceramide
> 3-beta-N-acetylgalactosaminyltransferase 3q26.1 3 NC_000003.12 161083883 161105469 minus 10 603094
This is not correct because when I search the NCBI gene database I get a completely different record (https://www.ncbi.nlm.nih.gov/gene/?term=LAMB2P1) which is correct when I just run the esearch on the command line as follows:
esearch -db gene -query '"LAMB2P1"[Gene name] AND Homo sapiens[ORGN] AND alive[Prop]' | efetch -format tabular
Which gives me:
> tax_id Org_name GeneID CurrentID Status **Symbol** Aliases description other_designations map_location chromosome genomic_nucleotide_accession.version start_position_on_the_genomic_accession end_position_on_the_genomic_accession orientation exon_count OMIM
> 9606 Homo sapiens 22973 0 live **LAMB2P1** LAMB2L laminin subunit beta 2
> pseudogene 1 laminin, beta 2 pseudogene
> 1 3p21.31 3 NC_000003.12 49152859 49154401 minus 1
Any thoughts or ways that I can correct this would be much appreciated. I have to query 1k symbols and I would like to use bash commands to do so instead of querying one by one. Thanks!
This is also a heads up for those using esearch. The esearch command consumed the entire file (stdin), hence I got output only for the first line. I used
IFS=$'\n'; for next in $(cat file); do ..; done
to get it to work with a file as input.Good catch - this post has a bit more detail on what's going on: Bash loop using Edirect (esearch|efetch)
TL;DR: The problem is with the
while
loop, not withesearch
.