How to compare the gene profile of two organisms (what genes are in common) using genBank?
1
0
Entering edit mode
4 days ago

I have two bacterial strains (alpha and beta) that have, respectively 4480 and 4424 genes, or 4369 and 4313 CDS. I counted these by using a

grep '(gene|CDS)*.\.\. {file}.gb | wc -l

(meaning I first looked at genes and then CDS) function but then I noticed that the GenBank file already contains these fields (e.g. CDSs (total) :: 4,480), which essentially confirmed my counts.

Now, alpha and beta differ for 56 genes/CDS and I would like to know what genes in particular are missing in each strain with respect to the other strain. I grepped the

/gene=

instance for each strain's gb file, sent the output to a spreadsheet and work on this to remove duplicates and check what genes where present in both strain using the vlookup function of the spreadsheet.

In this way, I have a count of 46 genes. There are 10 genes missing.

My questions are:

  1. What did I get wrong? Am I missing another term for 'gene' recorded in the GenBank format?
  2. Is there a more canonical (and precise) way of comparing the gene profiles of two strains? Is there a real bioinformatic tool for this job?

Thank you

genes genbank genome comparison • 279 views
ADD COMMENT
0
Entering edit mode
4 days ago
Mensur Dlakic ★ 28k

There should be multiple ways of doing this. I will suggest one of them.

First, it is clunky and likely not very precise to do this from GenBank files, as only the sequence comparison will properly answer your question. I assume you have predicted protein sequences for your strains in FASTA format. I also assume that your strains are very similar, and that sequence identity is 95+% for all shared proteins.

We start by concatenating the two proteomes, and then cluster all the sequences. There is a program for it called cd-hit:

https://github.com/weizhongli/cdhit

A command would be something like this:

cd-hit -i combined_proteome.fas -o combined_proteome_95.fas -c 0.95

All protein sequences they share at 95+% sequence identity would be in file combined_proteome_95.fas. But you want sequences they don't share, so there is an auxiliary file named combined_proteome_95.fas.clstr - basically the same name as given with the -o switch + .clstr. That file lists all the clusters, and you would have to go through them and find those clusters that have a single member.

ADD COMMENT
0
Entering edit mode

Thank you, the similarity is indeed over 99% but how do I concatenate the proteome?

ADD REPLY

Login before adding your answer.

Traffic: 1871 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6