Question

Sequence Database Without Splice Variants

0

Entering edit mode

11.3 years ago

Pappu ★ 2.1k

I am looking for a sequence database which does not contain shorter versions of the same protein (splice variants with >95% identity) and fragments. I also want fasta database to contain NCBI taxid of the species. Let me know if you give me some suggestions to build it from trEMBL or nr. Thanks.

database • 1.9k views

ADD COMMENT • link updated 11.3 years ago by Hamish ★ 3.3k • written 11.3 years ago by Pappu ★ 2.1k

score 1 · Answer 1 · 2013-07-23

1

Entering edit mode

11.3 years ago

Hamish ★ 3.3k

That sounds like you want some thing like the UniProt Reference Clusters (UniRef) databases. See http://www.uniprot.org/help/uniref.

The UniRef databases are derived using CD-HIT to merge splice variant (isoform) and fragment sequences, to three different levels of identity:

UniRef100: 100% identity
UniRef90: 90% identity
UniRef50: 50% identity

For downloads of all the UniProt databases, including the UniRef databases, see http://www.uniprot.org/downloads

ADD COMMENT • link 11.3 years ago by Hamish ★ 3.3k

0

Entering edit mode

I actually downloaded UniRef90. It still contains entries which are termed as fragments in uniprot.

ADD REPLY • link 11.3 years ago by Pappu ★ 2.1k

0

Entering edit mode

In cases where no full length sequence shares the threshold level of identity for the clustering, you will get clusters of fragments. Since these fragments are distinct from the available full length sequences they are informative, and depending on your requirements you will likely want to keep them. Otherwise, since they will always have a description containing the "(fragment)" keyword, you can filter them out of the downloaded data set. Either by processing the downloaded data or using a query on UniProt.org to get only the non-fragment clusters. For example:

http://www.uniprot.org/uniref/?query=NOT+name%3A%22%28fragment%29%22+AND+identity%3A0.9

ADD REPLY • link 11.3 years ago by Hamish ★ 3.3k