Question

How to obtain similarity score between proteomes?

0

Entering edit mode

10 months ago

Manuel ▴ 10

Let's say I have n fasta files with the coding sequences of bacteria of same species. I want to obtain a similarity score to understand how similar they are.

My current code, combines all the fasta files into one fasta.

make_diamond_cmd = f'./diamond makedb --threads 2 --in combined.fasta -d database.dmnd'
print("Creating Diamond database...")
_ = subprocess.check_call(make_diamond_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
diamond_db = f'database.dmnd'

 # running alignment
diamond_cmd = f'./diamond blastp --threads 2 --sensitive -d {diamond_db} -q combined.fasta -o results.tsv '
print("Running Diamond...")
_ = subprocess.check_call(diamond_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
# Read the dataframe from results.tsv
df = pd.read_csv('results.tsv', sep='\t')

# Remove the entries where column 0 is equal to column 1
df = df[df.iloc[:, 0] != df.iloc[:, 1]]

df.iloc[:, 2].plot.hist()

This code also accounts for the similarity within proteins of one genome, I want to compare across the genomes and get a similarity measure.

Best Regards,

Manuel

Proteome Blast Blastp • 418 views

ADD COMMENT • link updated 10 months ago by JC 13k • written 10 months ago by Manuel ▴ 10

0

Entering edit mode

It is not so easy, you need to filter secondary hits, and then set some thresholds to consider if 2 proteins are similar or not which is tricky as many protein families can be highly variable. Then you need to consider duplications and deletions.

ADD REPLY • link 10 months ago by JC 13k