How to obtain similarity score between proteomes?
0
0
Entering edit mode
10 months ago
Manuel ▴ 10

Let's say I have n fasta files with the coding sequences of bacteria of same species. I want to obtain a similarity score to understand how similar they are.

My current code, combines all the fasta files into one fasta.

make_diamond_cmd = f'./diamond makedb --threads 2 --in combined.fasta -d database.dmnd'
print("Creating Diamond database...")
_ = subprocess.check_call(make_diamond_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
diamond_db = f'database.dmnd'

 # running alignment
diamond_cmd = f'./diamond blastp --threads 2 --sensitive -d {diamond_db} -q combined.fasta -o results.tsv '
print("Running Diamond...")
_ = subprocess.check_call(diamond_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
# Read the dataframe from results.tsv
df = pd.read_csv('results.tsv', sep='\t')

# Remove the entries where column 0 is equal to column 1
df = df[df.iloc[:, 0] != df.iloc[:, 1]]

df.iloc[:, 2].plot.hist()

This code also accounts for the similarity within proteins of one genome, I want to compare across the genomes and get a similarity measure.

Best Regards,

Manuel

Proteome Blast Blastp • 416 views
ADD COMMENT
0
Entering edit mode

It is not so easy, you need to filter secondary hits, and then set some thresholds to consider if 2 proteins are similar or not which is tricky as many protein families can be highly variable. Then you need to consider duplications and deletions.

ADD REPLY

Login before adding your answer.

Traffic: 2461 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6