Sometimes I need to download, let's say around 3k sequences, using for this purpose Gi list. I know how to download them, the problem lies elsewhere. This gi list is generated by another script and sometimes among gi's related to protein sequences it contains also gi's of DNA (or perhaps RNA) sequences. Since I am interested only in protein sequences - is there a way to download only protein sequences?
I am using biopython but I haven't seen in SeqIO any variable which could describe type of sequence. Another option could be filtering fasta after download. Ofc I could iterate through whole sequence and check if it contain letters different than ATGCU, but this is not so elegant, and I think it will work slow. Do you know any better solutions?
Instead of downloading the FASTA file, you could first download the XML (TinySeq) descriptions from the NCBI. The TSeq_seqtype tag contains the "sequence type" attribute.
http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeqSet>
<TSeq>
<TSeq_seqtype value="nucleotide"/><TSeq_gi>25</TSeq_gi><TSeq_accver>X53813.1</TSeq_accver><TSeq_taxid>9771</TSeq_taxid><TSeq_orgname>Balaenoptera musculus</TSeq_orgname><TSeq_defline>Blue Whale heavy satellite DNA</TSeq_defline><TSeq_length>422</TSeq_length><TSeq_sequence>TAGTTATT(...)</TSeq></TSeqSet>
you can then transform your sequences back to fasta using XSLT:
Seems to be pretty good solution. I assume that I could download it (without saving on hdd), check type, and if protein transform to fasta in the fly and save on hdd.
You could fetch XML instead of fasta (contain detailed sequence description and sequence) using Biopython, but xml is a few magnitudes of size bigger! Have a look:
[?]
So I think, it's reasonable to get fasta and write some tiny function to recognize if you deal with dna or protein, let's say by looking at first 20-30 positions.
It's going to be faster than getting 100x more data for sure;)
I don't know if that is to risky, but it should be efficient to check only first letter, since I am downloading whole sequences it should start with "M".
No problem. In fact, this is how Multiple Sequence Aligners guess whether input is dna or protein sequence (i.e. Muscle). btw: are you using Entrez.EPost? it's much faster. have a look: A: Speed Of Efetch In Biopython
ADD REPLY
• link
updated 5.4 years ago by
Ram
45k
•
written 13.6 years ago by
Leszek
4.2k
Seems to be pretty good solution. I assume that I could download it (without saving on hdd), check type, and if protein transform to fasta in the fly and save on hdd.
Thanks. :)