Hello Everyone,
I have a total of ~130.000 SRA accession from which I need to retrieve the isolation source
and the location
.
$head -n 10 SRAyk.txt
DRR095581
SRR11035504
SRR9016627
SRR5826819
SRR11032323
SRR6801753
SRR10144785
SRR12961276
SRR5927939
ERR2563030
Here is the bash loop
for i in $(cat SRAyk.txt)
do
location=$(esearch -db sra -query $i < /dev/null |
elink -db sra -target biosample -name sra_biosample |
esummary |
xtract -pattern DocumentSummary -group Attribute -if Attribute@harmonized_name -equals "isolation_source" -element Attribute -group Attribute -if Attribute@harmonized_name -equals "geo_loc_name" -element Attribute);
echo -e "$i\t$location";
done
The problem I am facing is that esearch|elink|esummary|xtract
skip some SRA accession and this behavior seems to be completely random. The same happens if I use epost
instead of esearch
.
Is there anything I can do to solve this problem?
The second problem I am facing is that I have too many accession and will probably take days to complete the job. The SRA accessions were recovered from MicrobeAtlas and for each of them, I already have the latitude
and longitude
but not the name of the location. From this huge list of SRA accessions, I am only interested in those coming from the USA.
Probably I can reduce the number of SRA accession by focusing only on those with latitude values between 0<x<90
and longitude between -180<x<0
. Does it make sense?
Thank you!
ps. I have already set-up NCBI_API_KEY
as an environmental variable
Doing 130,000 searches is probably running afoul of some search limits. Add some kind of wait between blocks.
This information may also be in SRA metadata files. It may be better to search those.
Hi GenoMax
I did a bunch of tries with just 100 SRA accession and with
sleep 3s
at the end of the loop.It did not really solve the problem. By using the same acessions the performances seems to get worse after the first try: 1) 10 missed; 2) 30 missed; 3) 28 missed; 4) 24 missed; 5) 26 missed.
I also checked the SRA metadata file and the location doesn't seems to be reported in those files.
Because the behavior looks totally random I should probably contact the e-utilities help desk.
There is no harm in asking help desk.
Using API key allows a max of 10 request per second but since you are doing a complicated search the results are likely taking longer so that sleep 3 is probably not helping much.