Question

How to compare the gene profile of two organisms (what genes are in common) using genBank?

0

Entering edit mode

4 days ago

marongiu.luigi ▴ 730

I have two bacterial strains (alpha and beta) that have, respectively 4480 and 4424 genes, or 4369 and 4313 CDS. I counted these by using a

grep '(gene|CDS)*.\.\. {file}.gb | wc -l

(meaning I first looked at genes and then CDS) function but then I noticed that the GenBank file already contains these fields (e.g. CDSs (total) :: 4,480), which essentially confirmed my counts.

Now, alpha and beta differ for 56 genes/CDS and I would like to know what genes in particular are missing in each strain with respect to the other strain. I grepped the

/gene=

instance for each strain's gb file, sent the output to a spreadsheet and work on this to remove duplicates and check what genes where present in both strain using the vlookup function of the spreadsheet.

In this way, I have a count of 46 genes. There are 10 genes missing.

My questions are:

What did I get wrong? Am I missing another term for 'gene' recorded in the GenBank format?
Is there a more canonical (and precise) way of comparing the gene profiles of two strains? Is there a real bioinformatic tool for this job?

Thank you

genes genbank genome comparison • 279 views

ADD COMMENT • link 2 days ago by marongiu.luigi ▴ 730

score 0 · Answer 1 · 2024-12-18

There should be multiple ways of doing this. I will suggest one of them.

First, it is clunky and likely not very precise to do this from GenBank files, as only the sequence comparison will properly answer your question. I assume you have predicted protein sequences for your strains in FASTA format. I also assume that your strains are very similar, and that sequence identity is 95+% for all shared proteins.

We start by concatenating the two proteomes, and then cluster all the sequences. There is a program for it called cd-hit:

https://github.com/weizhongli/cdhit

A command would be something like this:

cd-hit -i combined_proteome.fas -o combined_proteome_95.fas -c 0.95

All protein sequences they share at 95+% sequence identity would be in file combined_proteome_95.fas. But you want sequences they don't share, so there is an auxiliary file named combined_proteome_95.fas.clstr - basically the same name as given with the -o switch + .clstr. That file lists all the clusters, and you would have to go through them and find those clusters that have a single member.