Question

Tools to find the unique proteins (without orthologs) in a bacterial taxon

0

Entering edit mode

9.4 years ago

natasha.sernova ★ 4.0k

Dear ALL,

I've used OMA-program to find a set of orthologous proteins in a bacterial taxon. Unfortunately this very nice program does not give a set of unique proteins.

I know that OrthoMCL- and OrthoDB-tools programs do it. I was not very successful in these proteins finding with these tools. Are there any other tolls to find the unique genes in a bacterial taxon?

I need all proteins that do not have any orthologs in this taxon, only unique proteins or singletons. Would you be so kind to give me some hints?

Thank you very much!

Natasha

orthologs protein taxon OMA bacteria • 4.2k views

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

How did you run OrthoMCL? What means "I was not very successful"? Did you get any results, or you did not even manage to get it running? What kind of data do you have? Predicted peptides from genomes, downloaded proteins from nr, etc?

My experience with OrthoMCL is it will output all clusters it finds, including clusters with only one gene and one taxon.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by h.mon 35k

0

Entering edit mode

Sorry, I was not quite correct. I've got a list of singletons from OrthoMCL. I would like to make sure this list of proteins is correct and complete, and other programs give the same set. I took all proteins for this bacterial taxon from NCBI, faa-datasets for each bacteria from this taxon. I have thrown away just proteins shorter than 50 aa without annotation. I considered all the proteins longer than 100 aa without looking at their annotation.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

If your datasets are the complete translated gene sets, then the OrthoMCL singletons is a good place to start. You could blast the singletons against the other genomes to see if they are there and were just not found.

If your data include translated transcriptomes, I do not know, because transcriptomes are often lacking genes due to non-expression on a particular tissue or developmental stage, or not sampled due to low expression.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by h.mon 35k

0

Entering edit mode

My dataset includes all the proteins for a particular bacterial taxon, but only proteins that could be found in NCBI for some short period of time.

They are the complete translated gene sets, so I will have to say, when exactly I ran a program (OrthoMCL) and necessarily state, that I considered NCBI data only. Other databases may easily be more complete and have more sequenced and translated genomes, it is not my problem. To blast the singletons against the other genomes from the other taxons is not required, I need the information about this particular taxon only. I hope the option in OrthoMCL functions properly. I would check it with some independent program - I don't know such a program.

My data do not include translated transcriptomes, so I don't worry about these difficulties.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

It turned out the question has been already discussed. Tool for finding unique sets of proteins

Even some tools are mentioned. I have to study this - it may help.

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

I'd recommend using Usearch:

usearch --cluster_fast proteins.fasta --id 0.70 --centroids proteins_centroids.fasta

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by steven ▴ 70

0

Entering edit mode

Dear Steven, Probably you have meant this program: http://drive5.com/usearch/manual/

In this example, usearch --cluster_fast proteins.fasta --id 0.70 --centroids proteins_centroids.fasta, what are the input and output fasta-files? I have not found the unique proteins option in their site yet, sorry.

I've found that the program may help to get rid of singletons.

USEARCH command for discarding singletons

usearch -sortbysize derep.fasta -output derep2.fasta -minsize 2

I'm afraid it won't help me. They provide hundreds of options, I have to study them. It's a great program! http://drive5.com/usearch/manual/all_opts.html

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k

1

Entering edit mode

Hi Natasha, sorry I wasn't very specific in my previous comment. 0.70 is the recommended id for proteins when clustering. Here is the man page for id: http://drive5.com/usearch/manual/opt_id.html

I think if the id is set to 1.0 (sequences must be 100% matching), the sequences can be clustered into a set of unique sequences, but I have not tested this myself. In general, clustering is used to eliminate redundant sequences, and I have used usearch to great success doing this. Therefore I think using an id of 1.0, it might be possible to extend usearch to perform the unique clustering function you desire.

For more information on clustering take a look at the wiki page: https://en.wikipedia.org/wiki/Sequence_clustering

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by steven ▴ 70

0

Entering edit mode

Dear Steven,

Thank you very much!

But it seems to me. that this approach will imply, that I already know my unique set proteins and compare these proteins with the database of all the proteins I have in this taxon.

But let's suppose I don't have any known proteins at all. How will it be better to start? The program is definitely knows how to search for the unique proteins, since there is a tool to get rid of them.

It's a mystery...

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k

0

Entering edit mode

Hi Natasha, the nice thing about usearch is that it compares each sequence to the other sequences in the file - so you don't need a reference file with unique proteins. So maybe try:

usearch --cluster_fast input.fasta --id 1.0 --centroids output.fasta

where input.fasta is the database of proteins you have from the taxon, and output.fasta is the file where any unique sequences will be output after clustering. After running usearch, you can use grep -c ">" input.fasta and then grep -c ">" output.fasta to see if the total number of sequences decreased. Again, I haven't tested usearch to find unique sequences but it might be worth a try!

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by steven ▴ 70

0

Entering edit mode

Dear Steven, it is definitely worth a try, moreover I don't see any nice alternative way.

Many thanks! I've just learnt about this program, who knows?

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k

Ram · Accepted Answer · 2015-06-30

2

Entering edit mode

9.4 years ago

h.mon 35k

In addition to OrthoMCL, Proteinortho has a command-line option to output "singles" clusters. It also has an option to include synteny on cluster predictions, which may interest you.

Regarding steven's suggestion of using usearch, I do remember of seeing (either on uclust / usearch manual, or on cd-hit manual) a quick and dirt orthologous prediction method by progressively clustering with less stringent similarity - but I can't find it again. Finally, there is an alternative to usearch, vsearch, which aims at being a faster and open-source drop-in replacement for usearch,

ADD COMMENT • link updated 23 months ago by Ram 44k • written 9.4 years ago by h.mon 35k

0

Entering edit mode

Unfortunately the link for Proteinotho is not reliable. I failed to find any better link to it.

http://www.bioinf.uni-leipzig.de/Software/proteinortho/

The site above seems to be valid.

Dear colleagues, THANK YOU VERY MUCH FOR YOUR HELP!

ADD REPLY • link updated 23 months ago by Ram 44k • written 9.4 years ago by natasha.sernova ★ 4.0k