Searching and filtering uniclust databases
1
2
Entering edit mode
5.6 years ago
max_19 ▴ 170

Hi there,

Does anyone have experience with searching or filtering uniclust databases: http://gwdu111.gwdg.de/~compbiol/uniclust/2018_08/

For example if i want to search for a particular organism? or filter for only eukaryotes? (in the uniclust30db)

I tried doing this with the mapping file that is supplied (uniclust_uniprot_mapping.tsv.gz) it has uniprot accessions for each protein, and a uniclust ID, however, I'm not sure how I can use that ID to search the actual uniclust database, or filter for particular organisms.

thanks for your help and ideas!

uniclust protein databases • 2.0k views
ADD COMMENT
2
Entering edit mode
5.6 years ago
AK ★ 2.2k

Hi max_19,

I think you can first get the ID list of particular organisms and use that to search on the header of uniclust30_2018_08/uniclust30_2018_08_consensus.fasta. The header looks like (Members contains the information you need here):

uc30-1808-83688326|Representative=A0A0D6LSX8 n=28 Descriptions=[Uncharacterized protein|Twk-43 (Fragment)|TWiK family of potassium channels protein 9|Twk-9|Protein CBR-TWK-9|Ion channel] Members=A0A2G5TZA1,A0A2A6BWN3,E3N9Z5,H3EBY7,A0A061AD18,A0A182E8X8,A0A2A2JAF3,A0A0B2UVL7,A0A2A6CBY6,A0A016U7K0,A8Y2T1,A0A1I8AN73,A0A2A2LWD0,A0A0D6LSX8,A0A0C2GMZ3,A0A1I8AAQ9,E3N9Z7,A0A0C2CU15,A0A2P4W1B0,A0A016U896,A0A2P4W1B3,A0A0B1TTC8,A0A016U8H3,H3F3P7,Q23435,A0A2K6W7A5,A0A2H2IN74,A0A0R3S4C4

For instance:

# From https://www.uniprot.org/taxonomy/2759 we know that the "Taxon identifier" is 2759 for Eukaryota
# Here we take the first 10 as an example
curl -s "https://www.uniprot.org/uniprot/?query=taxonomy:2759&format=tab&columns=id" \
  | grep -v '^Entry' \
  | head \
  > eukaryota_head10.txt

# Get the whole list of IDs from uniclust30
seqkit fx2tab --name uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
  > uniclust30_2018_08_consensus_IDs.txt

# Search for the exact match of the desired IDs (here the IDs from Eukaryota) and extract the matches
grep -w -f eukaryota_head10.txt uniclust30_2018_08_consensus_IDs.txt \
  | cut -d" " -f1 \
  | sort -u \
  > uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt

# Subset uniclust30 using the list
seqkit grep --delete-matched -f uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
  > uniclust30_2018_08_consensus_eukaryota_head10.fasta

Hope it helps.

ADD COMMENT
1
Entering edit mode

Thanks for the helpful information! The uniclust download that I am using does not contain the uniclust30_2018_08_consensus.fasta file . I downloaded this one (uniclust30_2018_08_hhsuite.tar.gz) because I am using the database with HHsuite eventually.

Here are the files that I have when I extract the database

uniclust30_2018_08_a3m_db         uniclust30_2018_08_cs219.ffdata   uniclust30_2018_08_hhm.ffdata
uniclust30_2018_08_a3m_db.index   uniclust30_2018_08_cs219.ffindex  uniclust30_2018_08_hhm.ffindex
uniclust30_2018_08_a3m.ffdata     uniclust30_2018_08.cs219.sizes    uniclust30_2018_08_md5sum
uniclust30_2018_08_a3m.ffindex    uniclust30_2018_08_hhm_db         
uniclust30_2018_08.cs219          uniclust30_2018_08_hhm_db.index

Do you know the equivalent file here? or which file i can use to subset?

Thanks!

ADD REPLY

Login before adding your answer.

Traffic: 1797 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6