I have two bacterial strains (alpha and beta) that have, respectively 4480 and 4424 genes, or 4369 and 4313 CDS. I counted these by using a
grep '(gene|CDS)*.\.\. {file}.gb | wc -l
(meaning I first looked at genes and then CDS) function but then I noticed that the GenBank file already contains these fields (e.g. CDSs (total) :: 4,480), which essentially confirmed my counts.
Now, alpha and beta differ for 56 genes/CDS and I would like to know what genes in particular are missing in each strain with respect to the other strain. I grepped the
/gene=
instance for each strain's gb file, sent the output to a spreadsheet and work on this to remove duplicates and check what genes where present in both strain using the vlookup
function of the spreadsheet.
In this way, I have a count of 46 genes. There are 10 genes missing.
My questions are:
- What did I get wrong? Am I missing another term for 'gene' recorded in the GenBank format?
- Is there a more canonical (and precise) way of comparing the gene profiles of two strains? Is there a real bioinformatic tool for this job?
Thank you
Thank you, the similarity is indeed over 99% but how do I concatenate the proteome?