Question

How To Check Sequence Type While Downloading

0

Entering edit mode

13.2 years ago

Mateusz ▴ 70

Hi,

Sometimes I need to download, let's say around 3k sequences, using for this purpose Gi list. I know how to download them, the problem lies elsewhere. This gi list is generated by another script and sometimes among gi's related to protein sequences it contains also gi's of DNA (or perhaps RNA) sequences. Since I am interested only in protein sequences - is there a way to download only protein sequences?

I am using biopython but I haven't seen in SeqIO any variable which could describe type of sequence. Another option could be filtering fasta after download. Ofc I could iterate through whole sequence and check if it contain letters different than ATGCU, but this is not so elegant, and I think it will work slow. Do you know any better solutions?

Thanks in advance!

Cheers, Mateusz

sequence scripting biopython python • 2.6k views

ADD COMMENT • link updated 13.2 years ago by Leszek 4.2k • written 13.2 years ago by Mateusz ▴ 70

Ram · Answer 1 · 2011-09-06

Instead of downloading the FASTA file, you could first download the XML (TinySeq) descriptions from the NCBI. The TSeq_seqtype tag contains the "sequence type" attribute.


 http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
 <TSeqSet>
    <TSeq>
        <TSeq_seqtype value="nucleotide"/>
        <TSeq_gi>25</TSeq_gi>
        <TSeq_accver>X53813.1</TSeq_accver>
        <TSeq_taxid>9771</TSeq_taxid>
        <TSeq_orgname>Balaenoptera musculus</TSeq_orgname>
        <TSeq_defline>Blue Whale heavy satellite DNA</TSeq_defline>
        <TSeq_length>422</TSeq_length>
        <TSeq_sequence>TAGTTATT(...)
    </TSeq>
</TSeqSet>

you can then transform your sequences back to fasta using XSLT:


<xsl:stylesheet xmlns:xsl="&lt;a href=" http:="" www.w3.org="" 1999="" XSL="" Transform"="" rel="nofollow">http://www.w3.org/1999/XSL/Transform" version="1.0">
  <xsl:output method="text"/>
  <xsl:template match="/">
    <xsl:for-each select="//TSeq[TSeq_seqtype/@value='nucleotide']">
      <xsl:value-of select="concat('&gt;gi|',TSeq_gi,'|',TSeq_accver,'|',TSeq_defline)"/>
      <xsl:text>
</xsl:text>
      <xsl:value-of select="TSeq_sequence"/>
      <xsl:text>
</xsl:text>
    </xsl:for-each>
  </xsl:template>
</xsl:stylesheet>

$ xsltproc stylesheet.xsl sequences.xml
>gi|25|X53813.1|Blue Whale heavy satellite DNA
TAGTTATTCAACCTATCCCACTCTCTAGATACCCCTTAGCACGTAAAGGAATATTATTTGGGGGTCCAGCCATGGAGAATAGTTTAGACACTAGGATGAGATAAGGAACACACCCATTCTAAAGAAATCACATTAGGATTCTCTTTTTAAGCTGTTCCTTAAAACACTAGAGTCTTAGAAATCTATTGGAGGCAGAAGCAGTCAAGGGTAGCCTAGGGTTAGGGTTAGGCTTAGGGTTAGGGTTAGGGTACGGCTTAGGGTACTGTTTCGGGGAGGGGTTCAGGTACGGCGTAGGGTATGGGTTAGGGTTAGGGTTAGGGTTAGTGTTAGGGTTAGGGCTCGGTTTAGGGTACGGGTTAGGATTAGGGTACGTGTTAGGGTTAGGGTAGGGCTTAGGGTTAGGGTACGTGTTAGGGTTAGGG

Ram · Answer 2 · 2011-09-06

3

Entering edit mode

13.2 years ago

Leszek 4.2k

You could fetch XML instead of fasta (contain detailed sequence description and sequence) using Biopython, but xml is a few magnitudes of size bigger! Have a look:

[?]

So I think, it's reasonable to get fasta and write some tiny function to recognize if you deal with dna or protein, let's say by looking at first 20-30 positions.

It's going to be faster than getting 100x more data for sure;)

ADD COMMENT • link 13.2 years ago by Leszek 4.2k

0

Entering edit mode

I don't know if that is to risky, but it should be efficient to check only first letter, since I am downloading whole sequences it should start with "M".

ADD REPLY • link 13.2 years ago by Mateusz ▴ 70

0

Entering edit mode

Checking only first letter is risky, as some proteins are post-processed and leading M is removed.

ADD REPLY • link 13.2 years ago by Leszek 4.2k

0

Entering edit mode

I've implemented check for first 20 residues (faster than playing around with XML) and it seems to work fine. Thanks!

ADD REPLY • link 13.2 years ago by Mateusz ▴ 70

0

Entering edit mode

No problem. In fact, this is how Multiple Sequence Aligners guess whether input is dna or protein sequence (i.e. Muscle). btw: are you using Entrez.EPost? it's much faster. have a look: A: Speed Of Efetch In Biopython

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 13.2 years ago by Leszek 4.2k