Hello,
I'm graduate student who studying bioinformatics.
I have one question about using CAZy database which download from dbCAN website.
When I annotate my protein sequence with CAZy database, the annotation results show multiple protein family like 'GT4|GT97'. So I check the database downloaded from dbCAN website, and there are some amino acid sequences named multiple family like below
>AIZ26250.1|GT4|GT97
MAVIIFVNGIRAVNGLVKSSINTANAFAEEGLDVHLINFVGNITGAEHLYPPFHLHPNVKTSSIIDLFNDIPENVSCRNTPFYSIHQQFFKAEYSAHYKHVLMKIESLLSAEDSIIFTHPLQLEMYRLANNDIKSKAKLIVQIHGNYMEEIHNYEILARNIDYVDYLQTVSDEMLEEMHSHFKIKKDKLVFIPNITYPISLEKKEADFFIKDNEDIDNAQKFKRISIVGSIQPRKNQLDAIKIINKIKNENYILQIYGKSINKDYFELIKKYIKDNKLQNRILFKGESSEQEIYENTDILIMTSESEGFPYIFMEGMVYDIPIVVYDFKYGANDYSNYNENGCVFKTGDISGMAKKIIELLNNPEKYKELVQYNHNRFLKEYAKDVVMAKYFTILPRSFNNVSLSSAFSRKELDEFQNITFSIEDSNDLAHIWNFELTNPAQNMNFFALVGKRKFPMDAHIQGTQCTIKIAHKKTGNLLSLLLKKRNQLNLSRGYTLIAEDNSYEKYIGAISNKGNFEIIANKKSSLVTINKSTLELHEIPHELHQNKLLIALPNMQTPLKITDDNLIPIQASIKLEKIGNTYYPCFLPSGIFNNICLDYGEESKIINFSKYSYKYIYDSIRHIEQHTDISDIIVCNVYSWELIRASVIESLMEFTGKWEKHFQTSPKIDYRFDHEGKRSMDDVFSEETFIMEFPRKNGIDKKTAAFQNIPNSIVMEYPQTNGYSMRSHSLKSNVVAAKHFLEKLNKIKVDIKFKKHDLANIKKMNRIIYEHLGININIEAFLKPRLEKFKREEKYFHDFFKRNNFKEVIFPSTYWNPGIICAAHKQGIKVSDIQYAAITPYHPAYFKSPKSHYVADKLFLWSEYWNHELLPNPTREIGSGAAYWYALDDVRFSEKLNYDYIFLSQSRISSRLLSFAIEFALKNPQLQLLFSKHPDENIDLKNRIIPDNLIISTESSIQGINESRVAVGVYSTSLFEALACGKQTFVVKYPGYEIMSNEIDSGLFFAVETPEEMLEKTSPNWVAVADIENQFFGQEK
And also, in the database, there are only single protein family sequence like
>AGU84174.1|GT4
MRICLVLEGSYPYVHGGVSTWMHQYITEMKEHEFIIWVIGANEEKKGAFVYEFPENVVEVHEVFLDSLGSSKIIEKKSEELSREEYDALKQLVFCAKPDWSLIFDLLQEGKIQRDDFLVSEAFFQMIQDLCEEKYAAQPMSDVFHTIRSILFPLLMLLTSEIPIADAYHAICTGYGGILATLASYRMGKPLLLTEHGIYTREREEEILRADWILPSMRKQWIDFFYMLSDAIYSKADCITSLFSKARETQIEIGCEPNKCRVISNGIDYESFSKIPFEKDDDSWINIGAAVRMAPIKDIKTMIYAFYEVSAQIPNVRLYIMGGVDDKAYAEECYALARKLKLENLIFTGRVDIKEYLRKMDFMILTSISEGQPLSILESMAAGKPCVTTDVGCCKELLEGREDDELGVAGYCVPPTDLMSLAHAMIVMARSEEKRLKMGQIAKKRSEQFYQYHQMIEQYRQLYKEYVR
So, here is my question:
When I do further analysis using the annotation result with abundance table, do I have to merge the abundance? or just think those annotation result as individually?
Thank you for reading my question...
Gyudae LEE
Please use the formatting bar (especially the
code
option) to present your post better. You can use backticks for inline code (`text` becomestext
), or select a chunk of text and use the highlighted button to format it as a code block. If your code has long lines with a single command, break those lines into multiple lines with proper escape sequences so they're easier to read and still run when copy-pasted. I've done it for you this time.Hi, Have you solved this problem? I met the same problem when I finished the blast and merge. While I wanted to calculate gene abundance during subsequent analysis, I wasn't sure how to classify or summarise a protein sequence when it belonged to multiple enzyme families. Could you help me with it, please?
Thanks so much.
One approach is to count abundance of domains. Suppose there are 10 CAZymes classified as GT4, and 10 classified as GT4+GT97. Then abundance of GT4 is 20, and abundance of GT97 is 10.
Thanks so much for your kind help and time! I will take a try.