I have a proteomes .faa file with all the protein sequences encoded by LUCA. I want to create a file with the proteomes of only a few specific eukaryotes, few specific archaea and all bacteria. How do I do this ?
I have tried downloading the taxdmp.zip file from the NCBI database and but the nodes.dmp and names.dmp file is not making sense to me. I am an undergrad and I would appreciate any help
Suppose this is the first entry in my proteomes file.
1000565.METUNv1_00006 MFSYVSLEQRVPKDHPLRSLRALVDGILANMSALFDERYSHTG
so here 1000565 is the tax id of the organism that has the gene METUNv1_00006 and the line below it is the sequence of amino acids in the protein encoded.
the nodes.dmp file has this the first column is my organism and 5th column is the division
but there is no archaea in divisions.dmp
I understood that the names.dmp file is only needed to see the tax id of the specific eukaryotes and archaea I need
I dont understand how do i use this information to sample the proteomes of LUCA so that I have a subset of proteomes which I actually need. Do I simply write python code that will do it for me ? Or is there another tool that is used to taxonomically sample a huge proteomes file ?
What do you mean by not making sense?
What sort of protein ids/accession do you have?
i have added details in my question. i hope it makes more sense now. sorry for the late response I was not having net connectivity