Associate Roary output with locus_tag from specific genome
1
0
Entering edit mode
5 months ago

Hello -

I have run Prokka to generate gff3 files, which were then passed to Roary to create a gene presence-absence data frame (gene_presence_absence.csv; see below for example). I have been requested to now associate the gene names from the gene_presence_absence.csv file with a locus_tag from a particular genome assembly (https://www.ncbi.nlm.nih.gov/genbank/genome_locustag/). For example, using the 'lpt_' locus tag associated with the Legionella pneumophila Toronto 2005 strain (https://www.ncbi.nlm.nih.gov/nuccore/CP012019.1/)

Why is this difficult to perform? Not all genes listed in the gene_presence_absence.csv file are listed in the genome assembly with the same shorthand gene name as many are only identifiable via annotation column (e.g. gyrA is listed as DNA gyrase subunit A with locus tag lpt_06745 for GenBank: CP012019.1).

There must be a straightforward way to do this, but I'm having trouble finding it. Any help is much appreciated!

Much-reduced example of the gene_presence_absence.csv file

Gene Annotation #isolates #sequences
macA Macrolide export protein MacA 241 241
gyrA DNA gyrase subunit A 241 241
dnaA Chromosomal replication initiator protein DnaA 241 241
bacteria locus_tags genomics • 728 views
ADD COMMENT
0
Entering edit mode

Using Python3, here is what I have so far:

from Bio import SeqIO

record = SeqIO.read('sequenceCP012019.1.gb', 'genbank')

for feature in record.features: 
    if feature.type == 'gene':
        print(feature)

Output:

type: gene
location: [751:2110](+)
qualifiers:
    Key: gene, Value: ['dnaA']
    Key: locus_tag, Value: ['lpt_00005']

type: gene
location: [2123:3227](+)
qualifiers:
    Key: locus_tag, Value: ['lpt_00010']

With this output, there are 122 values with a gene name because only that many have been annotated in the assembly (CP012019.1) while in the Roary output there are ~2500 gene names identified. How would I go about increasing the number of associated gene names with locus tags?

ADD REPLY
0
Entering edit mode

How would I go about increasing the number of associated gene names with locus tags?

If the annotations do not exist in NCBI record then you can't really do that.

ADD REPLY
0
Entering edit mode

Yea, that is what I was thinking, too, but maybe that is not the most accurate way to pose the question. Need to think a bit more...

ADD REPLY
0
Entering edit mode

I did not check every gene but most (all) seem to have the lpt_* tags in the genbank record you linked above. So perhaps it is a matter of mapping your roary results correctly to these?

ADD REPLY
0
Entering edit mode

It's been a while since I did this - but I have a vague recollection that you may be able to force roary to use the locus tags with an extra flag?

More broadly, I think trying to match up a database of gene names/locus tags/annotations is absolutely fraught with potential for error. If there is some way you can link up the sequences instead, the best other alternative would be to do sequence lookups/100% BLAST match finding and pair them off that way.

ADD REPLY
0
Entering edit mode
3 months ago

Okay, I partially answered my question.

From NCBI, I downloaded the coding sequences file from the reference genome of interest, which means I have the locus tags I want to associate with all of my genomes per gene. In other words, the coding sequence file lists each gene and the locus tag of interest per gene. I then ran Prokka, specifying this file using the --proteins flag for all my data (i.e., genomes of bacterial isolates). Then, I used standard command line tools (awk and sed) to get only the lpt locus tag and the associated locus tag generated per genome. Prokka generates unique locus tags for each genome tested.

The next step of my request is not needed, but I still don't know how to convert reference genome of interest locus tags to Roary gene names, but c'est la vie :)

ADD COMMENT

Login before adding your answer.

Traffic: 2133 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6