Format Uniref90.Xml To Database For Blast
7
1
Entering edit mode
11.2 years ago
lam ▴ 20

we want to get a formated database for BLAST from Uniref90. we referred an article which used an early release Uniref90, say version 10.0 ftp.uniprot.org only provide the XML format of this version. (xml file available: ftp://ftp.uniprot.org/pub/databases/uniprot/previous_releases/release10.0/uniref/uniref10.0.tar.gz) How can we convert XML uniref90.xml to uniref90.fasta ? or, can formatdb take XML file as input file? Thanks!

blast • 5.8k views
ADD COMMENT
2
Entering edit mode
11.2 years ago
Hamish ★ 3.3k

This rather quick and dirty Perl script will convert the UniRef XML into fasta sequence format:

#!/usr/bin/env perl
my $id = $des = undef;
my $seqsection = 0;
while(<>) {
    if(m/<entry id="(.*?)" /) {
        $id = $1;
    }
    elsif(m/<name>(.*?)<\/name>/) {
        $des = $1;
    }
    elsif(m/<sequence /) {
        print '>', $id, ' ', $des, "\n";
        $seqsection = 1;
    }
    elsif(m/<\/sequence>/) {
        $seqsection = 0;
    }
    elsif($seqsection && m/^([A-Z]+)$/) {
        print $_;
    }
}

You can then use the resulting fasta sequence format data to build your NCBI BLAST database.

BTW you probably want to use the current version of NCBI BLAST+ rather than the legacy NCBI BLAST software, so you will need to build your database with 'makeblastdb' rather than 'formatdb' (see http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=Download).

Alternatively you could contact UniProt (see http://www.uniprot.org/contact) and ask if a fasta sequence format version of UniRef90 is available for UniProt 10.0 (March 2007). They may have a copy, or be able to produce one for you.

ADD COMMENT
1
Entering edit mode
8.6 years ago

Previous releases of UniRef in FASTA format can be made available upon request. We also have a script that can be used to generate FASTA from UniRef XML. Don't hesitate to contact the UniProt helpdesk.

ADD COMMENT
1
Entering edit mode
8.6 years ago

Hi Shyam, yes, we are considering this. The script is brandnew and was written triggered by your request! Thanks for your suggestion.

ADD COMMENT
0
Entering edit mode
11.2 years ago
lam ▴ 20

We tried biopython SeqIO.convert.

>>> from Bio import SeqIO
>>> count = SeqIO.convert("uniref90.xml", "uniprot-xml", "uniref90converted.fasta", "fasta")
>>> print("Converted %i records" % count)
Converted 0 records

We checked uniref90.xml:

more uniref90.xml

<UniRef90 xmlns="&lt;a href=" http:="" uniprot.org="" uniref"="" rel="nofollow">http://uniprot.org/uniref"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://uniprot.org/uniref http://www.uniprot.org/support/docs/uniref.xsd"
releaseDate="2007-03-06" version="10.0">
<entry id="UniRef90_Q3ASY8" updated="2007-03-06">
<name>Cluster: Parallel beta-helix repeat</name>
<property type="member count" value="1"/>
<property type="common taxon" value="Chlorobium chlorochromatii CaD3"/>
<property type="common taxon ID" value="340177"/>
<representativeMember>
<dbReference type="UniProtKB ID" id="Q3ASY8_CHLCH">
<property type="UniProtKB accession" value="Q3ASY8"/>
<property type="UniParc ID" value="UPI00005D5563"/>
<property type="UniRef100 ID" value="UniRef100_Q3ASY8"/>
<property type="UniRef50 ID" value="UniRef50_Q3ASY8"/>
<property type="protein name" value="Parallel beta-helix repeat"/>
<property type="source organism" value="Chlorobium chlorochromatii (strain CaD3)"/>
<property type="NCBI taxonomy" value="340177"/>
<property type="length" value="36805"/>
<property type="isSeed" value="true"/>
</dbReference>
<sequence length="36805" checksum="A7A8EA21B9345FF9">
MKPRFYIEQLEPRILLSGDILSELVPLLSSREASQMQSDYLLEHPEARRVAPLSAVEAAR
....

Could you help us, thanks.

ADD COMMENT
0
Entering edit mode
8.6 years ago
Shyam ▴ 150

Hi: Are you able to extract the fasta sequences from the xml file. If ye, can please share how you did it. Thank you

ADD COMMENT
0
Entering edit mode

UniRef can be downloaded in FASTA format, there is no need to convert the XML file into FASTA: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/

ADD REPLY
0
Entering edit mode

But of the previous releases, only xml file is available. I wrote to uniprot tech support and waiting for their reply. I asked them if they have any way of doing it.

ADD REPLY
0
Entering edit mode
8.6 years ago
Shyam ▴ 150

Thank you Elisabeth for your help in getting the fasta files for the previous releases.

ADD COMMENT
0
Entering edit mode
8.6 years ago
Shyam ▴ 150

@Elisabeth One more thing, why can the help desk post the script some where so that anyone in need can use as the uniprot xml schema is different from the regular xml schema ( if I understood correctly)

ADD COMMENT
0
Entering edit mode

In case anyone is looking for this UniRef xml parser it can be found at https://proteininformationresource.org/download/uniref/xml2fasta/

ADD REPLY

Login before adding your answer.

Traffic: 1940 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6