Why is asn2fasta so slow compared to efetch (when the sequence is not part of asn.1)?
1
0
Entering edit mode
4.9 years ago

Hi everybody,

I need to download sequences (fasta) with their annotation data (gff3) from ncbi based on their accession number. I've used entrez efetch for that job and retrieved data in asn.1 and converted to fasta and gff3 with asn2fasta and annotwriter from ncbi c++ toolkit. However for some Refseq records, the raw sequence information is not part of the asn.1 record and the asn2fasta needs to download it from some ncbi webservice. However it takes ages, compared to plain efetch.

For example, it takes efetch 1.3 seconds to download fasta sequences for these two refseq accessions "NW_003726435.1, NW_003729148.1", while asn2fasta, with asn.1 records already obtained in the file takes about 40 seconds (for one sequenece about 37 seconds).

Do anybody have any idea, why the asn2fasta is so slow, and/or how to make it run faster?

Best regards

ncbi entez asn2fasta • 1.2k views
ADD COMMENT
0
Entering edit mode

This really is a question for NCBI help desk. Be aware that it may take 2-3 business days to get an answer from them but be patient. Come back and post the official response here when you get one.

ADD REPLY
2
Entering edit mode
4.9 years ago
tdmurphy ▴ 230

It probably has to do with efetch being able to resolve the sequence information locally on NCBI's servers, whereas asn2fasta is having to resolve each component sequence remotely. Not sure there's anything you can do about it.

If you need large numbers of sequences from specific genomes, you'd be better off downloading the existing GFF3 files from their FTP site. You can get the path from Assembly:

esearch -db nuccore -query 'NW_003726435.1, NW_003729148.1' | elink -target assembly -name nuccore_assembly | esummary | xtract -pattern DocumentSummary -element FtpPath_RefSeq
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/285/GCF_000002285.3_CanFam3.1

If you're just trying to get a few sequences from each of many genomes, you can also take a look at Retrieve GFF3 file from ncbi for a way to get GFF3 for specific sequences. However, note the output won't be identical to what you would get from the FTP files. Notably, there's a risk of two features on two separate sequences winding up with the same GFF3 ID. That's more an issue for prokaryote genomes, or complicated annotations like human where the same gene may be annotated on multiple sequences (e.g. on both the X & Y in the PAR region, or on many ALT or PATCH scaffolds). For dog you're probably safe. If it's a problem, you could tack the sequence accession from column 1 onto the beginning of your IDs and that would ensure a unique ID.

ADD COMMENT
0
Entering edit mode

Thank you for the answer. I'm aware that I can download prepossessed gff files from ncbi (and it is the path I'm using for the bulk of the data).

The stated path is intended as complementary route for accessions which were not part of the initial data fetch (and are not present locally). The problem with some asn.1 files is that the sequence is not part of the asn.1 record, which causes asn2fasta to query it at some (I assume ncbi) server (otherwise the performance is OK).

I use efetch regularly I was not impressed with the asn2fasta performance, since the job it needs to do is basically efetch, as the accession number and other identifiers are present in the asn file.

It appears that I'll need to code some check if the sequence is part of the asn record and if not, just do plain efetch.

But thanks anyway

ADD REPLY
0
Entering edit mode

I will second genomax's recommendation to contact NCBI's help desk. They'll at least be interested in what you're doing, and may have some suggestions on alternate approaches.

Are you getting acceptable performance from annotwriter? It also has to do some remote lookups from NCBI servers, although not the same type as asn2fasta.

It appears that I'll need to code some check if the sequence is part of the asn record and if not, just do plain efetch.

Note most RefSeq genomic records are CON records and don't contain sequence in the ASN.1. The exceptions are: https://www.ncbi.nlm.nih.gov/nuccore/?term=refseq%5Bfilter%5D+AND+biomol_genomic%5BPROP%5D+NOT+gbdiv_con%5Bprop%5D+NOT+wgs_master%5Bprop%5D

ADD REPLY
0
Entering edit mode

Yes, the annotwriter gives me acceptable performance - I didn't even notice that there are remote lookups with annotwriter.

I'll contact the NCBI's help desk and see if they have any insight.

Thank you

ADD REPLY

Login before adding your answer.

Traffic: 1986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6