Hi there,
I'm running a pangenome analyisis on 20 bacterial strains using roary version 3.13, on a server running CentOS 7. I want to get the unique genes present in only one of those 20 strains but, as I've already asked here, I cannot get them.
When running:
query_pan_genome -a difference --input_set_one 1.gff --input_set_two 2.gff 3.gff 4.gff .... -g clustered_proteins
I get a csv file with some clusters that are supposed to be unique to strain1, but they are not! If I retrieve the sequence using the sqlite3 db suggested here and blast it, I find a perfect match with one of the other 19th strain (the reference one, by the way). Moreover, these genes in the reference are functionally properly annotated (i.e short-chain dehydrogenase), while in the csv is "hypotetical protein" (but that's problably is prokka annotation failure). I also tried to select the only-one-strain column from the clustered_proteins file as suggested here, but still get wrong ones. By reading this other issue, I tried the option -s but I just got less "unique" clusters, but still wrong ones. What's the problem?? Is roary really supposed to do so or not??
Thanks, Silvia
Hi Sissi,
just to understand. Since you mentioned prokka, did you use it to re-annotate the genomes? If so, that could be the problem. You basically have two different versions of each genome: prokka and the NCBI Prokaryotic Genome Annotation Pipeline
Hi Andres, Thank you for your reply.
I first downloaded the FASTA file from NCBI genome ( This is my Reference as example ) and then run Prokka:
This because I'm also using a newly sequenced strain that is not deposited yet and thus, I wanted to start from the same annotation. Then I run roary in two ways
And this brought me to the problem above.
This is just my opinion. If the reference genomes have been already annotated there is no need to run the annotation again; unless you demostrate that your annotation pipeline is far better than the ones used for the reference genomes. You should use the gff files from the ncbi database; that is your reference. Keep in mind that different annotation pipeline will give you different results. Therefore, if your are mainly interested in clusters occuring only in one strain, use
tblastn
to double check that your genes of interest are actually missing the other strainsFollowing your suggestion, I tried to use the ncbi gff files and prokka gff file for the newly sequenced strain from Prokka, but first got:
and then stopped:
So, I tried to remove the only gff from Prokka and still, roary doesn't like the gff from ncbi:
And there are no output files.
Edit. Btw, I got the same problems with unique genes also with other samples.
before using the NCBI gff file check this: https://github.com/sanger-pathogens/Roary/issues/120
Ok so,
NCBI+Prokka of course is not working.
(according to blastn).
This is amazing, really.
(Ps. I'm probably working where you got your PhD ;) )
Finding my email should not be a problem then. If you contact me we can definitely solve this problem :).
This is the best option. The problem is that the gff from NCBI do not contain the nucleotide sequence at the end of the file hence, you need to find a tool that convert a
gbk
file into agff
format compatible with roary.