Entering edit mode
10 months ago
Manuel
▴
10
Let's say I have n fasta files with the coding sequences of bacteria of same species. I want to obtain a similarity score to understand how similar they are.
My current code, combines all the fasta files into one fasta.
make_diamond_cmd = f'./diamond makedb --threads 2 --in combined.fasta -d database.dmnd'
print("Creating Diamond database...")
_ = subprocess.check_call(make_diamond_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
diamond_db = f'database.dmnd'
# running alignment
diamond_cmd = f'./diamond blastp --threads 2 --sensitive -d {diamond_db} -q combined.fasta -o results.tsv '
print("Running Diamond...")
_ = subprocess.check_call(diamond_cmd, shell=True, stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
# Read the dataframe from results.tsv
df = pd.read_csv('results.tsv', sep='\t')
# Remove the entries where column 0 is equal to column 1
df = df[df.iloc[:, 0] != df.iloc[:, 1]]
df.iloc[:, 2].plot.hist()
This code also accounts for the similarity within proteins of one genome, I want to compare across the genomes and get a similarity measure.
Best Regards,
Manuel
It is not so easy, you need to filter secondary hits, and then set some thresholds to consider if 2 proteins are similar or not which is tricky as many protein families can be highly variable. Then you need to consider duplications and deletions.