question for dowloading Protein fasta sequence by edirect using while
2
0
Entering edit mode
1 day ago
诗友 • 0

hello everyone,

I'm a student major in immunolgy.

I want to download some protein fasta sequence based on some symbol, like LOC8030992, the species is Ixodes scapularis.

the chatGPT provide a method for me, like

cat gene_list.txt | while read gene; 
do     
     esearch -db gene -query "$gene [GENE] AND Ixodes scapularis [ORGN]" |   \
     elink -target protein |  \
     efetch -format fasta >> Ixodes_proteins.fasta; 
done

and my gene_list.txt looks like :

$ cat gene_list.txt 
LOC8030992
LOC8033311
LOC121835630
LOC121835700
LOC121999999
LOC8033311

the question is the script just run using the first line LOC8030992 and stop.

my files is ok, because

$ cat gene_list.txt | while read gene; do  echo $gene;  done
LOC8030992
LOC8033311
LOC121835630
LOC121835700
LOC121999999
LOC8033311

the top2 symbols LOC8030992,LOC8033311 are protein coding genes but the 3rd to 5th LOC121835630,LOC121835700,LOC121999999 are ncRNA.

why the circulation is not executed correctly?

edirect • 176 views
ADD COMMENT
1
Entering edit mode
22 hours ago
GenoMax 149k

LOC121835630,LOC121835700,LOC121999999 are ncRNA.

Since these are ncRNA they are not going to be found in the protein database. You will need to use nuccore database for those.

There is an additional issue, based on what I see below this is a predicted gene (so computationally predicted, XR* accession) as a result it will not be in the genes database.

$ esearch -db nuccore -query LOC121999999 | efetch -format fasta | grep ">"
>XR_006116882.1 PREDICTED: Zingiber officinale uncharacterized LOC121999999 (LOC121999999), ncRNA
>NC_055998.1 Zingiber officinale cultivar Zhangliang chromosome 7A, Zo_v1.1, whole genome shotgun sequence
ADD COMMENT
1
Entering edit mode

Two nuances --

  1. When you know that the query string is a gene symbol, it's better to search nuccore database as -query 'LOC121999999[GENE]' so that you look for the string in the gene symbol field. This is not an issue here because LOC121999999 string would not be present anywhere else other than the gene symbol field. But if it were something like 'HexA' then the search would look for this string in every possible field.
  2. When you run the search against nuccore with a gene symbol as the query, it will return both genomic and transcript accessions. In this case, the NC_055998.1 is the chromosome on which the gene is annotated and XR_006116882.1 is the transcript that represents this gene. If you want only the transcript accessions, you can add biomol_rna[PROPERTIES] to the query.
    $ esearch -db nuccore -query 'LOC121999999[GENE]' | efetch -format acc 
    XR_006116882.1
    NC_055998.1
    $ esearch -db nuccore -query 'LOC121999999[GENE] biomol_rna[PROPERTIES]' | efetch -format acc 
    XR_006116882.1
    
    You still need to be careful with this. Because the esearch -db nuccore command will return both RefSeq and GenBank accessions. If you want only RefSeq data, you will want to either use a RefSeq filter or elink as suggested by ChatGPT.
ADD REPLY
1
Entering edit mode
12 hours ago
vkkodali_ncbi ★ 3.8k

the question is the script just run using the first line LOC8030992 and stop.

You need to use a </dev/null with your while loop. See EntrezDirect page (search for "While Loop") to see a little bit more information.

That said, I recommend using NCBI Datasets for this. Download the command line tool (see documentation) and do the following:

$ datasets download gene symbol --inputfile gene_list.txt --taxon 'Ixodes scapularis' --include protein 
Collecting 4 gene records [================================================] 100% 4/4
Downloading: ncbi_dataset.zip    4.79kB valid zip structure -- files not checked
Validating package [================================================] 100% 5/5
$ unzip -v ncbi_dataset.zip 
Archive:  ncbi_dataset.zip
 Length   Method    Size  Cmpr    Date    Time   CRC-32   Name
--------  ------  ------- ---- ---------- ----- --------  ----
    1593  Defl:N      764  52% 2025-03-12 20:24 c5d88f1b  README.md
    3163  Defl:N     1907  40% 2025-03-12 20:24 b8a7086d  ncbi_dataset/data/protein.faa
    3066  Defl:N      835  73% 2025-03-12 20:24 6427788f  ncbi_dataset/data/data_report.jsonl
     328  Defl:N      183  44% 2025-03-12 20:24 2e55327b  ncbi_dataset/data/dataset_catalog.json
     207  Defl:N      133  36% 2025-03-12 20:24 dd6160ee  md5sum.txt
--------          -------  ---                            -------
    8357             3822  54%                            5 files
$ unzip ncbi_dataset.zip ncbi_dataset/data/protein.faa
Archive:  ncbi_dataset.zip
  inflating: ncbi_dataset/data/protein.faa  
$ seqkit stats ncbi_dataset/data/protein.faa
file                           format  type     num_seqs  sum_len  min_len  avg_len  max_len
ncbi_dataset/data/protein.faa  FASTA   Protein         2    2,979      464  1,489.5    2,515

Why are there only two proteins in the output file? The input gene_list.txt file you provided has LOC8033311 present twice. LOC121999999 is not an Ixodes gene, so that did not make it to the final output. Out of the remaining 4 genes, only two are protein coding.

ADD COMMENT

Login before adding your answer.

Traffic: 2629 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6