comparison between a selected genes of a new strain and the same genes of the reference genome
3
1
Entering edit mode
10 weeks ago
avinci1 ▴ 10

Dear all,

I would like to ask how I can see how much difference (in percentage) there is between a reference genome and a new strain of the same species.

In detail, I have sequenced a new strain of a certain organism and I already have the genome annotation (provided by the sequencing centre) of this new strain. I would like to perform a comparison between a selected genes of my new strain and the same genes of the reference genome. There are specific tools or pipelines that I can use to do this comparison or do I have to do it manually with BLAST or something similar?

Thank you all.

strain genome pipeline • 1.0k views
ADD COMMENT
0
Entering edit mode

The genes of interest are spliced or not? Also: any chance of gene duplications?

ADD REPLY
0
Entering edit mode

all I have is the annotated genes from the genomic DNA extraction and there are gene duplications

ADD REPLY
0
Entering edit mode

I don't see any reason not to do this with BLAST and keep it simple.

If you want to look at the similarity more generally e.g. in terms of core genomes etc, then a pipeline like roary would do the job.

ADD REPLY
0
Entering edit mode

BLAST is ok, it's just that I have 300 genes to check, that's why I asked for a pipeline or something faster. I will check roary, thank you!

ADD REPLY
0
Entering edit mode

You could just do a multi-BLAST and filter out the ones you want by ID after the fact. Just because it's BLAST doesn't mean you need to do them by hand by any means.

ADD REPLY
0
Entering edit mode

I will try, thank you!

ADD REPLY
2
Entering edit mode
10 weeks ago

Hi, I would suggest our own https://github.com/eead-csic-compbio/get_homologues. You can feed it Genbank files, as you would usually with bacteria, or FASTA files of encoded protein sequences. If you leave only the ~300 genes of interest in the input file of the new strain you can compute the average sequence identity with option -A. By default it will use BLASTP and protein sequences, but you can ask it to use nucleotide sequences instead with -a 'CDS'. There's a step by step tutorial and a manual.

ADD COMMENT
0
Entering edit mode

I will try, thank you!

ADD REPLY
0
Entering edit mode
9 weeks ago
Darked89 4.7k

To be on the safe side you may consider Best Reciprocal Hits for your genes/proteins of interest. See i.e. this article: Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2

You will need (if we stay at the protein level):

  1. fasta file proteome strain_A
  2. fasta file proteome_strain_B
  3. fasta file with 300 proteins of interest (strain_B)

The above article compares various sequence search programs, blastp and last being the most commonly used. Create say lastdb databases from #1 use #3 as a query. Get the top results for each of the proteins of interest, (strain_A proteins), save as fasta, search against #2.

You will need a command line executables, i.e. on Linux or WSL. Some parsing / scripting required.

ADD COMMENT
0
Entering edit mode

I will try! Thank you for the article, I will read it as soon as I can!

ADD REPLY
0
Entering edit mode
9 weeks ago

For a general estimate of genome similarity, you can use BBTools like this:

comparesketch.sh in=strain.fasta ref=reference.fasta

That will give you a good overview. It's an alignment-free method, so it is very fast. If you want specifics about how and where they differ, you need to use alignment, but for statistics like ANI and genome completeness, it works very well.

If you're dealing with proks, you can even add the "translate" flag to do gene-calling and compare them in amino acid space, which is a bit more sensitive.

ADD COMMENT
0
Entering edit mode

Thank you!!

ADD REPLY

Login before adding your answer.

Traffic: 2516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6