I want to extract a list of sequences from NCBI. I am esearch command mentioned here. For one gene symbol, I could do it like this:
esearch -db nuccore -q 'SS1G_01676[gene]' | efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v RS='(^|\n)>' '/SS1G_01676/{print RT $0}'
I want to use a bash loop to extract a list of sequences and below is what I have tried, but wouldn't yield any results. What am I missing here?
declare -a arr=("SS1G_03709" "SS1G_07286" "SS1G_04907")
for i in "${arr[@]}"
do
myquery="'${i}[gene]'"
echo "myid :" ${i}
echo "my query :" ${myquery}
esearch -db nuccore -q ${myquery} | efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v RS='(^|\n)>' '/${i}/{print RT $0}' >>text.fasta
done
This is not a solution to OP issue. However code can be simplified ( I removed awk part of it and that can be simplified if OP describes what he/she wants to do with fasta output:)
For getting fasta:
Input:
Thank you! With the awk part, I just wanted to grab fasta sequence (reference gene sequence) for the corresponding gene symbol.
awk doesn't take bash variable. You need to convert bash variable to awk variable. Try this (input remains the same as above): MAPK
@cpad0112 Thank you, but this would not give me the fasta for the rest of the genes (you get only for the first gene).
Sometimes an alternative approach is better. Have you considered NCBI Batch Entrez?