Dear ALL,
I've used OMA-program to find a set of orthologous proteins in a bacterial taxon. Unfortunately this very nice program does not give a set of unique proteins.
I know that OrthoMCL- and OrthoDB-tools programs do it. I was not very successful in these proteins finding with these tools. Are there any other tolls to find the unique genes in a bacterial taxon?
I need all proteins that do not have any orthologs in this taxon, only unique proteins or singletons. Would you be so kind to give me some hints?
Thank you very much!
Natasha
How did you run OrthoMCL? What means "I was not very successful"? Did you get any results, or you did not even manage to get it running? What kind of data do you have? Predicted peptides from genomes, downloaded proteins from nr, etc?
My experience with OrthoMCL is it will output all clusters it finds, including clusters with only one gene and one taxon.
Sorry, I was not quite correct. I've got a list of singletons from OrthoMCL. I would like to make sure this list of proteins is correct and complete, and other programs give the same set. I took all proteins for this bacterial taxon from NCBI, faa-datasets for each bacteria from this taxon. I have thrown away just proteins shorter than 50 aa without annotation. I considered all the proteins longer than 100 aa without looking at their annotation.
If your datasets are the complete translated gene sets, then the OrthoMCL singletons is a good place to start. You could blast the singletons against the other genomes to see if they are there and were just not found.
If your data include translated transcriptomes, I do not know, because transcriptomes are often lacking genes due to non-expression on a particular tissue or developmental stage, or not sampled due to low expression.
My dataset includes all the proteins for a particular bacterial taxon, but only proteins that could be found in NCBI for some short period of time.
They are the complete translated gene sets, so I will have to say, when exactly I ran a program (OrthoMCL) and necessarily state, that I considered NCBI data only. Other databases may easily be more complete and have more sequenced and translated genomes, it is not my problem. To blast the singletons against the other genomes from the other taxons is not required, I need the information about this particular taxon only. I hope the option in OrthoMCL functions properly. I would check it with some independent program - I don't know such a program.
My data do not include translated transcriptomes, so I don't worry about these difficulties.
It turned out the question has been already discussed. Tool for finding unique sets of proteins
Even some tools are mentioned. I have to study this - it may help.
I'd recommend using Usearch:
Dear Steven, Probably you have meant this program: http://drive5.com/usearch/manual/
In this example,
usearch --cluster_fast proteins.fasta --id 0.70 --centroids proteins_centroids.fasta
, what are the input and output fasta-files? I have not found the unique proteins option in their site yet, sorry.I've found that the program may help to get rid of singletons.
USEARCH command for discarding singletons
I'm afraid it won't help me. They provide hundreds of options, I have to study them. It's a great program! http://drive5.com/usearch/manual/all_opts.html
Hi Natasha, sorry I wasn't very specific in my previous comment. 0.70 is the recommended id for proteins when clustering. Here is the man page for id: http://drive5.com/usearch/manual/opt_id.html
I think if the id is set to 1.0 (sequences must be 100% matching), the sequences can be clustered into a set of unique sequences, but I have not tested this myself. In general, clustering is used to eliminate redundant sequences, and I have used usearch to great success doing this. Therefore I think using an id of 1.0, it might be possible to extend usearch to perform the unique clustering function you desire.
For more information on clustering take a look at the wiki page: https://en.wikipedia.org/wiki/Sequence_clustering
Dear Steven,
Thank you very much!
But it seems to me. that this approach will imply, that I already know my unique set proteins and compare these proteins with the database of all the proteins I have in this taxon.
But let's suppose I don't have any known proteins at all. How will it be better to start? The program is definitely knows how to search for the unique proteins, since there is a tool to get rid of them.
It's a mystery...
Hi Natasha, the nice thing about usearch is that it compares each sequence to the other sequences in the file - so you don't need a reference file with unique proteins. So maybe try:
where
input.fasta
is the database of proteins you have from the taxon, andoutput.fasta
is the file where any unique sequences will be output after clustering. After running usearch, you can usegrep -c ">" input.fasta
and thengrep -c ">" output.fasta
to see if the total number of sequences decreased. Again, I haven't tested usearch to find unique sequences but it might be worth a try!Dear Steven, it is definitely worth a try, moreover I don't see any nice alternative way.
Many thanks! I've just learnt about this program, who knows?