Hello -
I have run Prokka to generate gff3 files, which were then passed to Roary to create a gene presence-absence data frame (gene_presence_absence.csv; see below for example). I have been requested to now associate the gene names from the gene_presence_absence.csv file with a locus_tag from a particular genome assembly (https://www.ncbi.nlm.nih.gov/genbank/genome_locustag/). For example, using the 'lpt_' locus tag associated with the Legionella pneumophila Toronto 2005 strain (https://www.ncbi.nlm.nih.gov/nuccore/CP012019.1/)
Why is this difficult to perform? Not all genes listed in the gene_presence_absence.csv file are listed in the genome assembly with the same shorthand gene name as many are only identifiable via annotation column (e.g. gyrA is listed as DNA gyrase subunit A with locus tag lpt_06745 for GenBank: CP012019.1).
There must be a straightforward way to do this, but I'm having trouble finding it. Any help is much appreciated!
Much-reduced example of the gene_presence_absence.csv file
Gene | Annotation | #isolates | #sequences |
---|---|---|---|
macA | Macrolide export protein MacA | 241 | 241 |
gyrA | DNA gyrase subunit A | 241 | 241 |
dnaA | Chromosomal replication initiator protein DnaA | 241 | 241 |
Using Python3, here is what I have so far:
Output:
With this output, there are 122 values with a gene name because only that many have been annotated in the assembly (CP012019.1) while in the Roary output there are ~2500 gene names identified. How would I go about increasing the number of associated gene names with locus tags?
If the annotations do not exist in NCBI record then you can't really do that.
Yea, that is what I was thinking, too, but maybe that is not the most accurate way to pose the question. Need to think a bit more...
I did not check every gene but most (all) seem to have the
lpt_*
tags in the genbank record you linked above. So perhaps it is a matter of mapping yourroary
results correctly to these?It's been a while since I did this - but I have a vague recollection that you may be able to force
roary
to use the locus tags with an extra flag?More broadly, I think trying to match up a database of gene names/locus tags/annotations is absolutely fraught with potential for error. If there is some way you can link up the sequences instead, the best other alternative would be to do sequence lookups/100% BLAST match finding and pair them off that way.