How do I loop over a list of gene symbols to extract sequences from NCBI?
1
2
Entering edit mode
6.1 years ago
MAPK ★ 2.1k

I want to extract a list of sequences from NCBI. I am esearch command mentioned here. For one gene symbol, I could do it like this:

esearch -db nuccore -q 'SS1G_01676[gene]' | efilter -source refseq -molecule genomic | efetch -format gene_fasta |  awk -v RS='(^|\n)>' '/SS1G_01676/{print RT $0}'

I want to use a bash loop to extract a list of sequences and below is what I have tried, but wouldn't yield any results. What am I missing here?

declare -a arr=("SS1G_03709" "SS1G_07286" "SS1G_04907")
for i in "${arr[@]}"
do
myquery="'${i}[gene]'"
echo "myid :" ${i}
echo "my query :" ${myquery}
esearch -db nuccore -q ${myquery} | efilter -source refseq -molecule genomic | efetch -format gene_fasta |  awk -v RS='(^|\n)>' '/${i}/{print RT $0}' >>text.fasta
done
ncbi eutils bash • 3.0k views
ADD COMMENT
3
Entering edit mode

This is not a solution to OP issue. However code can be simplified ( I removed awk part of it and that can be simplified if OP describes what he/she wants to do with fasta output:)

For getting fasta:

$ cat test.txt | while read line; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta ; done

Input:

$ cat test.txt 
SS1G_03709
SS1G_07286
SS1G_04907
ADD REPLY
0
Entering edit mode

Thank you! With the awk part, I just wanted to grab fasta sequence (reference gene sequence) for the corresponding gene symbol.

ADD REPLY
1
Entering edit mode

awk doesn't take bash variable. You need to convert bash variable to awk variable. Try this (input remains the same as above): MAPK

$ while read line; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v r="$line"  'BEGIN {RS="(^|\n)>"} $0 ~ r {print ">" $0}'  ; done < test.txt

>lcl|NW_001820832.1_gene_616 [locus_tag=SS1G_03709] [db_xref=GeneID:5491341] [partial=5',3'] [location=<1605066..>1605750] [gbkey=Gene]
ATGCATTTCTCAACTGCAAAAACGCTTCTTCCTCTCGCAGTTCTAGTTTCCTATACCACCGCTCAAACAA
CAGCTGCAGCACCACCTGTTGCTAGTGCTCCTACAGGCGGCACTTCTAGTACTTGTCTCGGACAAAATGT
TCTAGATGCTTGTTTGAGTTCTACGACGGCTATTGCTCAGGCATGTCAACCTACCGATTATAATTGTTTA
TGCACAAGTTGGAATGCAGTTTTAACGTACGCTCCTTCCTTATTCACTATCCTTCTAATCCCTTCTCTCC
TCCATCAAATCCCCTTCATTATACTAATTCTCCCCTCCCTTTCTCAGATGCTACAACCAATGTCCCAACG
ACGCAGGTTACGCGGGCGCTCTATCTAATAAACAAACTTATTGTAATAATGCATCAGTTTATACATCTAC
TTCCTCATCAGCTATCTCGAGAGATTGGCCTACGTCTGCTACTGCTACTGCGGCGGGTGGTGCTACTGGT
GCAGGTGCCGGTGCAAGAGCAGGTGCGACAAGTACAGCAGCGAGAGCGAGTTCTATGGGCAGAAATGAGA
GTAAGACGGCTAGTGCGACGGGGGCGGAGAGTAGTGATAAGAGTAGTGGGGCTGGGGAGAGAGGGGTTAG
TGGATTGGTGGGTGTTATTGGGGGTGTGGGGGTGGTGGTGGTGGGGTTTTTGTAG
ADD REPLY
0
Entering edit mode

@cpad0112 Thank you, but this would not give me the fasta for the rest of the genes (you get only for the first gene).

ADD REPLY
1
Entering edit mode

Sometimes an alternative approach is better. Have you considered NCBI Batch Entrez?

ADD REPLY
4
Entering edit mode
6.1 years ago
vkkodali_ncbi ★ 3.8k

You are right, it stops after the first gene. I am not sure why. (EDIT: see explanation below) This works though:

for line in `cat temp.txt`; do esearch -db nuccore -q $line| efilter -source refseq -molecule genomic | efetch -format gene_fasta | awk -v r="$line"  'BEGIN {RS="(^|\n)>"} $0 ~ r {print ">" $0}'  ; done

Apparently, for while loops involving esearch one should add < /dev/null to prevent esearch from "draining" the remaining lines from stdin. See documentation here: https://www.ncbi.nlm.nih.gov/books/NBK179288/#chapter6.While_Loop

ADD COMMENT

Login before adding your answer.

Traffic: 2488 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6