I am looking for a sequence database which does not contain shorter versions of the same protein (splice variants with >95% identity) and fragments. I also want fasta database to contain NCBI taxid of the species. Let me know if you give me some suggestions to build it from trEMBL or nr. Thanks.
I actually downloaded UniRef90. It still contains entries which are termed as fragments in uniprot.
In cases where no full length sequence shares the threshold level of identity for the clustering, you will get clusters of fragments. Since these fragments are distinct from the available full length sequences they are informative, and depending on your requirements you will likely want to keep them. Otherwise, since they will always have a description containing the "(fragment)" keyword, you can filter them out of the downloaded data set. Either by processing the downloaded data or using a query on UniProt.org to get only the non-fragment clusters. For example:
http://www.uniprot.org/uniref/?query=NOT+name%3A%22%28fragment%29%22+AND+identity%3A0.9