The entry name in UniProtKB/Swiss-Prot is composed of two parts which provide an indicator of the gene symbol and the species. The first part, which provides the gene memonic, is not guaranteed to always refer to the same gene, or be the same for all instances of the gene, so I am not sure why you would want to cluster based on this?
For what it is worth, you might find this easier if you use the fasta sequence format files provided by UniProt (see http://www.uniprot.org/downloads) instead of the NCBI nr version, since these use a cleaner version of the fasta header, which makes it easier to extract the gene symbol using something like:
zcat uniprot_sprot.fasta.gz | perl -ne 'print $1, "\t", $2, "\t", $3, "\n" if(m/^>\S+\|(\w+)\|(\w+)_\w+\s+.*? GN=([^ ]+)/);'
If you don't actually need the fasta file, you could do this by steaming the data from the UniProt FTP site:
wget -q -O - ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz | zcat | perl -ne 'print $1, "\t", $2, "\t", $3, "\n" if(m/^>\S+\|(\w+)\|(\w+)_\w+\s+.*? GN=([^ ]+)/);'
This gives a three column tab-delimited table containing the UniProtKB accession, the gene memonic from the entry name and the gene symbol, for example:
Q6GZX4 001R FV3-001R
Q6GZX3 002L FV3-002L
Q197F8 002R IIV3-002R
Q197F7 003L IIV3-003L
Q6GZX2 003R FV3-003R
Q6GZX1 004R FV3-004R
Q197F5 005L IIV3-005L
Q6GZX0 005R FV3-005R
Q91G88 006L IIV6-006L
Q6GZW9 006R FV3-006R
In any case from you description it sounds like Pfam or UniRef are what you are looking for, since these already incorporate the clustering, and are not limited by the peculiarities of the UniProtKB entry names.
Why is the annotation format not straightforward?
Also, clustering based on the accession number seems a bit odd. To my knowledge, the accession numbers are assigned somewhat random, depending on when they were added to the database. I might be wrong, though, but I've never seen any other claim, nor could I find any documentation describing this.
(By Swissprot I assume you're referring to the Uniprot database)
Yes but the Swissprot part not the TREMBL part
A bit more explanation. The swissprot annotation has an acession number, a short code, and a longer code:
In this example the short code is C75A3 and this is directly related to its biochemical function, hence does provide a quite good character to do a preliminary classification. The problem is that the amount of data prior to the short code differ, otherwise I would simply paste the fasta file in a spreddie, using the pipe as a separator. Sort, copy and download one by one (still work, but feasible).....
I see. By short code I thought you referred to the nun-human readable accession number.
I would just match that by a RegEx. As I recall, the "C75A3_PETHY" part of the annotation is always at the last pipe.
Doing it in python would be something along the line of
Great clue the last pipe, should be able to script this in PERL. Tx!!