Question

comparison between a selected genes of a new strain and the same genes of the reference genome

1

Entering edit mode

10 weeks ago

avinci1 ▴ 10

Dear all,

I would like to ask how I can see how much difference (in percentage) there is between a reference genome and a new strain of the same species.

In detail, I have sequenced a new strain of a certain organism and I already have the genome annotation (provided by the sequencing centre) of this new strain. I would like to perform a comparison between a selected genes of my new strain and the same genes of the reference genome. There are specific tools or pipelines that I can use to do this comparison or do I have to do it manually with BLAST or something similar?

Thank you all.

strain genome pipeline • 1.0k views

ADD COMMENT • link 9 weeks ago by avinci1 ▴ 10

0

Entering edit mode

The genes of interest are spliced or not? Also: any chance of gene duplications?

ADD REPLY • link 10 weeks ago by Darked89 4.7k

0

Entering edit mode

all I have is the annotated genes from the genomic DNA extraction and there are gene duplications

ADD REPLY • link 10 weeks ago by avinci1 ▴ 10

0

Entering edit mode

I don't see any reason not to do this with BLAST and keep it simple.

If you want to look at the similarity more generally e.g. in terms of core genomes etc, then a pipeline like roary would do the job.

ADD REPLY • link 10 weeks ago by Joe 21k

0

Entering edit mode

BLAST is ok, it's just that I have 300 genes to check, that's why I asked for a pipeline or something faster. I will check roary, thank you!

ADD REPLY • link 10 weeks ago by avinci1 ▴ 10

0

Entering edit mode

You could just do a multi-BLAST and filter out the ones you want by ID after the fact. Just because it's BLAST doesn't mean you need to do them by hand by any means.

ADD REPLY • link 10 weeks ago by Joe 21k

0

Entering edit mode

I will try, thank you!

ADD REPLY • link 9 weeks ago by avinci1 ▴ 10

score 2 · Answer 1 · 2024-09-17

2

Entering edit mode

10 weeks ago

b.contreras.moreira ▴ 310

Hi, I would suggest our own https://github.com/eead-csic-compbio/get_homologues. You can feed it Genbank files, as you would usually with bacteria, or FASTA files of encoded protein sequences. If you leave only the ~300 genes of interest in the input file of the new strain you can compute the average sequence identity with option -A. By default it will use BLASTP and protein sequences, but you can ask it to use nucleotide sequences instead with -a 'CDS'. There's a step by step tutorial and a manual.

ADD COMMENT • link 10 weeks ago by b.contreras.moreira ▴ 310

0

Entering edit mode

I will try, thank you!

ADD REPLY • link 9 weeks ago by avinci1 ▴ 10

score 0 · Answer 2 · 2024-09-18

To be on the safe side you may consider Best Reciprocal Hits for your genes/proteins of interest. See i.e. this article: Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2

You will need (if we stay at the protein level):

fasta file proteome strain_A
fasta file proteome_strain_B
fasta file with 300 proteins of interest (strain_B)

The above article compares various sequence search programs, blastp and last being the most commonly used. Create say lastdb databases from #1 use #3 as a query. Get the top results for each of the proteins of interest, (strain_A proteins), save as fasta, search against #2.

You will need a command line executables, i.e. on Linux or WSL. Some parsing / scripting required.

score 0 · Answer 3 · 2024-09-18

0

Entering edit mode

9 weeks ago

Brian Bushnell 20k

For a general estimate of genome similarity, you can use BBTools like this:

comparesketch.sh in=strain.fasta ref=reference.fasta

That will give you a good overview. It's an alignment-free method, so it is very fast. If you want specifics about how and where they differ, you need to use alignment, but for statistics like ANI and genome completeness, it works very well.

If you're dealing with proks, you can even add the "translate" flag to do gene-calling and compare them in amino acid space, which is a bit more sensitive.