Hi max_19,
I think you can first get the ID list of particular organisms and use that to search on the header of uniclust30_2018_08/uniclust30_2018_08_consensus.fasta
.
The header looks like (Members
contains the information you need here):
uc30-1808-83688326|Representative=A0A0D6LSX8 n=28 Descriptions=[Uncharacterized protein|Twk-43 (Fragment)|TWiK family of potassium channels protein 9|Twk-9|Protein CBR-TWK-9|Ion channel] Members=A0A2G5TZA1,A0A2A6BWN3,E3N9Z5,H3EBY7,A0A061AD18,A0A182E8X8,A0A2A2JAF3,A0A0B2UVL7,A0A2A6CBY6,A0A016U7K0,A8Y2T1,A0A1I8AN73,A0A2A2LWD0,A0A0D6LSX8,A0A0C2GMZ3,A0A1I8AAQ9,E3N9Z7,A0A0C2CU15,A0A2P4W1B0,A0A016U896,A0A2P4W1B3,A0A0B1TTC8,A0A016U8H3,H3F3P7,Q23435,A0A2K6W7A5,A0A2H2IN74,A0A0R3S4C4
For instance:
# From https://www.uniprot.org/taxonomy/2759 we know that the "Taxon identifier" is 2759 for Eukaryota
# Here we take the first 10 as an example
curl -s "https://www.uniprot.org/uniprot/?query=taxonomy:2759&format=tab&columns=id" \
| grep -v '^Entry' \
| head \
> eukaryota_head10.txt
# Get the whole list of IDs from uniclust30
seqkit fx2tab --name uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
> uniclust30_2018_08_consensus_IDs.txt
# Search for the exact match of the desired IDs (here the IDs from Eukaryota) and extract the matches
grep -w -f eukaryota_head10.txt uniclust30_2018_08_consensus_IDs.txt \
| cut -d" " -f1 \
| sort -u \
> uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt
# Subset uniclust30 using the list
seqkit grep --delete-matched -f uniclust30_2018_08_consensus_IDs_eukaryota_head10.txt uniclust30_2018_08/uniclust30_2018_08_consensus.fasta \
> uniclust30_2018_08_consensus_eukaryota_head10.fasta
Hope it helps.
Thanks for the helpful information! The uniclust download that I am using does not contain the uniclust30_2018_08_consensus.fasta file . I downloaded this one (uniclust30_2018_08_hhsuite.tar.gz) because I am using the database with HHsuite eventually.
Here are the files that I have when I extract the database
Do you know the equivalent file here? or which file i can use to subset?
Thanks!