prokka with too many "hypothetical proteins" and genome comparison
1
1
Entering edit mode
1 day ago
avinci1 ▴ 20

Dear all,

I'm not an expert, I'm a PhD student in entomology that has been asked to do some bioinformatic analysis just because I know the real basic of BLAST and I need simply and easy explanations. Sorry for that. I'm having problems with the comparison of two bacterial genomes. I have the prokka file from the assemply of boths, but there are too many "hypotetical proteins" and I need to know what these proteins are. How can I manage that? They are too many to do that by hand on BLAST. I'm trying to use some of the tools that are on "galaxy europe", but I have no idea of what suits best.

When I will find these proteins, I have to compare the two genomes, and I don't know what tool I have to use.

Can somebody help me please?

comparison genome galaxy prokka • 444 views
ADD COMMENT
1
Entering edit mode

what exactly do you need to compare? on the gene/protein function level or just compare absence/present of genes ?

You can (crude approach) run a blast with mutlifasta input (== all you proteins in a single file) and take for instance the first/best hit as a proxy for the protein function

ADD REPLY
0
Entering edit mode

I have two bacteria genomes that I isolated and I have to know if they are the same or not, and if not, how much they are different. The problem is that in the prokka files almost all proteins are called "hypotetical proteins" and the comparison gives me the results that they are the same, but I cannot be sure because the sequence may be different, but the tool compares with names. I need to know the gene/protein function and if they are the same in the two different bacteria.

I thought of doing the multifasta, but I thought it was a bit rude. I can give it a try thou.

Thany you

ADD REPLY
0
Entering edit mode

What state are these genomes in? Single contig or multiple pieces. Were they assembled from short or long read data? Where did these bacteria come from and do you expect them to be related.

ADD REPLY
0
Entering edit mode

Basically I sent the DNA of both bacteria to a sequencing service and they gave us the annotation files, with excel files, prokka annotations, busco, fastq files and so on.

So I have all these files for both genomes, but in the annotation there are too many hypotetical proteins. If I BLAST them by hand I can find what they are, but I cannot do for all of them. They are too many.

I need the comparison because I have to see if the two bacteria are different strain of the same bacteria or the same bacteria and I isolated the same thing from different samples.

ADD REPLY
1
Entering edit mode

So I have all these files for both genomes

But you left out an important answer. How many pieces (contigs) are there per genome? Is it a single fasta sequence or more than one (if so, how many) per genome?

Was this a reputable sequencing service, which you know would have done the best possible job with assembly. Running data through an automated pipeline without any manual oversight will give results but they need not be accurate.

If you simply need an identity answer then @Mensur has recommended a couple of programs.

ADD REPLY
0
Entering edit mode

There are 64 contigs for one genome and 114 for the other one. The sequencing service should be reputable. I obtained the same output with a generic prokka analysis, so I guess they did the minimum effort, but I cannot tell.

I will try all the various suggestion all of you gave me. Thank you so much

ADD REPLY
1
Entering edit mode

What do you mean there are too many? It is normal to encounter a lot of hypotheticals unless its a very well characterised organism.

If it is, the pipeline should handle identifying orthologues automatically quite well, but you can provide it a database of existing proteins/annotations (I believe the option is called --trusted-proteins or something like that).

If you want to know if the 2 sequences are of the same strain/species, comparing annotations is probably not the best approach. You'd be best off just getting a quick Kmer distance or something I think.

Alternatively, you can take the FASTQ files and run them through a tool like kraken or centrifuge which will specifically identify the organism for you.

ADD REPLY
0
Entering edit mode

We tried to select the specific database for the bacteria, but the result was the same. All hypotetical proteins. At least half of the output was hypotetical. Let's say I luse the annotation with all the hypotetical. How I can compare the two genome annotations so I can understand if they are different or not? At the moment, the output I receive is that they are the same except for three proteins that have different names. But I'm not sure if the hypotetical proteins are the same because they share the same name or because their sequence is the same. So I need to use all the sequences to do the comparison, and not the annotation. But I don't know how to do except performing a multi blast with all the sequences I have

ADD REPLY
1
Entering edit mode

What is the organism? 50% hypothetical isn't that crazy to me if its something a bit unusual. I spent my PhD working on a gene cluster where only about 3 out of ~17 proteins had an annotation that was anything other than "hypothetical protein".

You might have better luck not specifying the database in that case (it may be overly restrictive). You can also set lower sequence identity thresholds to try and get more annotations. I would consider using a tool like hhsuite on your predicted CDS amino acid sequences to get more 'distant' annotations.

If all you care about is "are these 2 sequencing results from the same species/strain" then the annotations are not really the important thing (unless you are specifically looking for some particular conserved gene etc). What you actually need is a measure of 'sameness'. For this, a "kmer distance" or simple BLAST identity would probably be sufficient (but kraken etc are the more robust approaches for this).

If you specifically want a gene-by-gene comparison, then what you're probably looking at doing is exactly what your intuition is telling you - BLAST. More specifically though, you want what is called a Best Reciprocal BLAST (https://www.protocols.io/view/reciprocal-best-hit-blast-x54v9rezv3eq/v2).

Bascially try to pair up gene sequences for all of your predicted genes (could be AA or DNA seq, doesn't really matter). You can then compute statistics on the various blast hits to get a sense of similarity (e.g. how many genes are >99% identical etc)

ADD REPLY
0
Entering edit mode

Thank you so much for your help. I don't know how usual is to have 50% of hypotetical proteins. If I have to give you a number they are around 2000 proteins, and doing one by one on BLAST seemed too much work for me. I can't tell the detailes since this is not for my PhD, I'm helping another PhD student with his project by doing this, but my actual PhD work is something else. That's why it looked a lot of work to me, if I have to do that by hand. ahahah

"If all you care about is "are these 2 sequencing results from the same species/strain" then the annotations are not really the important thing (unless you are specifically looking for some particular conserved gene etc)", --> I'm looking for some active proteins and I want to see if these proteins are different from the ones in other strains or not, that's why I was using the annotation. But if I have all hypotetical proteins I don't know what they are, if they are what I'm looking for and so on. I expect to find the same conserved genes, but for active proteins or compouns I expect some differences, that's why I'm having problems with the hypotetical proteins.

Anyway I will try with your suggestions.

Another questions. Now I have to do this with a bacteria, but later I will have to do the same thing with a fungus. Since it's an eukaryotic organism, can I use the same tools you mentioned or they don't work with eukaryotic organisms, since they have introns and so on...?

ADD REPLY
1
Entering edit mode

Are you working with nucleotide sequences or amino acid sequences? It's best to perform all these kinds of sequence comparisons with the latter in your case. This way, also, irrespective of the kind of organisms under consideration, sequence comparisons should remain tractable (i.e., you will not have to worry about introns and company unless you're specifically interested in examining such features).

Also, if what you're mostly interested in is trying to find out whether a pair of genomes has counterparts for genes/proteins found in one or the other (of these genomes), you should probably look into reciprocal best hits (e.g., with MMseqs2 rbh; refer https://mmseqs.com/latest/userguide.pdf section Reciprocal best hit using mmseqs rbh). This paper (https://doi.org/10.1186/s12864-020-07132-6) may also be of interest to you as it lists some other relevant tools. Reciprocal best hits will find homologs (i.e., evolutionary counterparts) in the sequence sets under consideration and will do this irrespective of whether the sequences are annotated or not. So even if you have a bunch of sequences annotated as "hypothetical", it won't matter. Once you find the sequence counterparts, you can then look at things like percentage sequence identity to examine just how similar the two sequences are (all this information will be made available by tools like MMSeqs2 rbh as a part of their output(s)).

ADD REPLY
0
Entering edit mode

I'm working with everything I have, to be honest. I will work with aa sequences and try MMseq2.

Thank you so much for the help!

ADD REPLY
0
Entering edit mode

You are welcome. I wish you good luck!!

ADD REPLY
1
Entering edit mode

As mentioned, 50% wouldn't be that crazy for something unusual.

You don't need to do them one by one - basically all modern bioinformatics tools support multifasta formats so all of your sequences can be searched in a single go.

If as you say you care about a specific subset of genes, I would simply treat one set of sequences as a reference (or download a 3rd genome from a public repository that is well curated to serve as your 'reference' for both of your new ones). If you know the genes of interest, I assume you have at least some of the sequences for these known? If you have no reference sequence and no clear annotation, there will be no easy way to 'extract' these from the sequencing you have.

Of the tools I mentioned, BLAST will work with everything. Prokka and Kraken are microbial specific (and I think prokaryotic specific to be exact). Centrifuge works with eukaryotic sequences as far as I know.

It depends a bit what the input data is (if you have the protein sequence for example you don't need to worry about introns, however some tools will not work for other reasons such as the databases they query etc)

ADD REPLY
1
Entering edit mode
1 day ago
Mensur Dlakic ★ 29k

FastANI will calculate average nucleotide identity for whole genomic sequences. FastAAI will do the same for whole proteomes. The accuracy of protein annotations doesn't matter in either case.

ADD COMMENT
0
Entering edit mode

Thank you! I will try that!

ADD REPLY

Login before adding your answer.

Traffic: 1494 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6