Question

Opinion about a genome comparisson pipeline.

1

Entering edit mode

4.7 years ago

K.Gee ▴ 40

Hello to everybody.

As I start studying bioinformatics recently, I need from an expert to validate if my "pipeline" is correct.

Let's say we have two genomes, A &B and from each genome, a group of encoded genes, Ax=20 & Bx 15, respectively.

To find the non shared genes between those two genomes:

I. created a DB of the genes of the genome A (Ax database), and I blast the Bx against Ax. I used this command to count the numbers of shared genes.

for the genome A

awn '{print $1}' blast_result.txt |sort -u|wc -l --> 10 genes

and for the genome B

awk '{print $2}' blast_blast.txt |sort -u|wc -l -->8 genes

After I did tblastn of Ax genes against B genome in which I have --> 1 hit

and Bx genes against A genome which results in ---> 2 hits

So the actual number of NON shared genes between genomes A&B is :

For genome A 20 -(10 +1) =9 non-shared genes with genome B

For genome B 15 - (8 +2) = 5 non shared genes with genome A

Is this pipeline correct? Thanks in advance!!!

genome genes comparisson • 1.3k views

ADD COMMENT • link updated 4.7 years ago by lieven.sterck 15k • written 4.7 years ago by K.Gee ▴ 40

0

Entering edit mode

That looks reasonably. You're essentially describing reciprocal best BLAST hits. You might want to read up on this as a technique and see if you can improve on your approach at all.

ADD REPLY • link 4.7 years ago by Joe 22k

0

Entering edit mode

thank a lot for the responce end for the protocol :-)!!!

ADD REPLY • link 4.7 years ago by K.Gee ▴ 40

0

Entering edit mode

How far apart are these genomes in evolutionary terms? If they are very similar/close using CD-HIT to identify a redundant set of genes should leave differences behind. Especially if you have protein sequences.

ADD REPLY • link 4.7 years ago by GenoMax 148k

0

Entering edit mode

The other question here is why exactly create this pipeline? if you're looking for accessory genes, lots of pangenome tools already exist that can give you this information.

ADD REPLY • link 4.7 years ago by Joe 22k

0

Entering edit mode

I will check to see if I will have the same numbers as my approach!! No, I am not looking for something special. I'm just trying to understand the "bioinformatician point of thinking ". Combining a few command prompt, few Perl scripts (even some basics scripts), viewing some software, etc. It's a newly chaotic world for me :P. Thank you so much for your suggestions.

ADD REPLY • link 4.7 years ago by K.Gee ▴ 40

0

Entering edit mode

Cool. I will check with your suggested tool :D. Right now I'm testing some knowledge that I learned. I'm doing random blasts just to understand who the materials work. The theory is good but it is completely different when you applied what your learnt!!! Thanks a lot :D

ADD REPLY • link 4.7 years ago by K.Gee ▴ 40

score 2 · Accepted Answer · 2020-04-30

Yep, that seems OK to me. Nice to see btw you considered doing a tblastn search on top of the blastp (:thumbs_up:)

the only thing you might need to check is to see if the tblastn hit is not already present in the result of your blastp, or do you only tblastn the proteins not present in the blastp output? if so, then it's ok

though correct this is a very basic (but correct) approach. As a more advanced approach you could consider running a protein clsutering (gene families) tool on the reciprocal blastp output, something like inparanoid, orthofinder, ... parsing those outputs will also show you the non-shared, or species specific, genes. Downside here is that including the genes missed in annotations (== the ones you picked up with tblastn) will be a bit harder.