hello everyone,
I'm a student major in immunolgy.
I want to download some protein fasta sequence based on some symbol, like LOC8030992, the species is Ixodes scapularis.
the chatGPT provide a method for me, like
cat gene_list.txt | while read gene;
do
esearch -db gene -query "$gene [GENE] AND Ixodes scapularis [ORGN]" | \
elink -target protein | \
efetch -format fasta >> Ixodes_proteins.fasta;
done
and my gene_list.txt
looks like :
$ cat gene_list.txt
LOC8030992
LOC8033311
LOC121835630
LOC121835700
LOC121999999
LOC8033311
the question is the script just run using the first line LOC8030992 and stop.
my files is ok, because
$ cat gene_list.txt | while read gene; do echo $gene; done
LOC8030992
LOC8033311
LOC121835630
LOC121835700
LOC121999999
LOC8033311
the top2 symbols LOC8030992,LOC8033311 are protein coding genes but the 3rd to 5th LOC121835630,LOC121835700,LOC121999999 are ncRNA.
why the circulation is not executed correctly?
Two nuances --
-query 'LOC121999999[GENE]'
so that you look for the string in the gene symbol field. This is not an issue here because LOC121999999 string would not be present anywhere else other than the gene symbol field. But if it were something like 'HexA' then the search would look for this string in every possible field.biomol_rna[PROPERTIES]
to the query. You still need to be careful with this. Because theesearch -db nuccore
command will return both RefSeq and GenBank accessions. If you want only RefSeq data, you will want to either use a RefSeq filter orelink
as suggested by ChatGPT.