Question

Find a human mRNA transcript with specific dinucleotide frequencies close to user input values

0

Entering edit mode

3.4 years ago

xiaoleiusc ▴ 140

Dear All,

Please forgive me for using layman language to describe my problem, I don't have enough computational or statistics background.

I have a list of all human mRNA transcripts, each with transcript name ID and dinucleotide frequencies (16 values containing frequencies of AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT) as the example here https://photos.app.goo.gl/BfiXgqHvWKjnA6KE9. I would like to find in the list which transcript contains the dinucleotide frequencies of user-given values. For example, If I have a set of di-nucleotide frequencies values (0.041247485 0.054325956 0.054325956 0.06639839 0.079476861 0.068410463 0.00804829 0.099597586 0.048289738 0.042253521 0.047283702 0.063380282 0.046277666 0.09054326 0.092555332 0.097585513), and I would like to find the one transcript in my list which contains the closest values (can be exactly the same values) to my input values. Is there a Python or R script for this purpose? If not, any suggestions of some background readings for solving this problem are greatly appreciated.

Thanks,

Xiao

Python RNA-seq R • 1.8k views

ADD COMMENT • link updated 2.2 years ago by Ram 45k • written 3.4 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

I just realized that the script below may not work for you as written. If a file has multiple sequences, it will read them all and calculate 2N frequencies for the group. This won't be a problem for fasta files with single sequences, but for multiple sequences the code needs to be changed around if w == 1: so it reads one sequence at a time and compares it to desired 2N frequencies.

ADD REPLY • link 3.4 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you Mensur, you are right that this script is not for my purpose. In my case, no fasta sequences are involved. I already have a list of di-nucleotide frequencies of all the human cell mRNAs (I don't need to calculate frequencies). I just need to compare the distance of each "vector" in the list to a input vector with desired values, and find the vector in the list which has shortest "distance" to the input vector. I feel this is more of a "distance" comparison problem instead of di-nucleotide frequencies calculation problem.

ADD REPLY • link 3.4 years ago by xiaoleiusc ▴ 140

score 1 · Answer 1 · 2021-11-18

I already had python code for dinucleotide frequencies, so I simply added couple of lines to calculate the similarity between two vector numbers. When this code runs against GCA_016930515.1_ASM1693051v1_genomic.fna, these are dinucleotide frequencies it outputs:

AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT
0.1283484,0.0424221,0.0512048,0.1035982,0.0526092,0.0356933,0.0361471,0.0506783,0.0653872,0.0319605,0.0355826,0.0425519,0.0792283,0.0650529,0.0525437,0.1269918

And this is what compseq from EMBOSS calculates for the same file:

AA      513353          0.1283484       0.0625000       2.0535742
AC      169675          0.0424221       0.0625000       0.6787536
AG      204803          0.0512048       0.0625000       0.8192767
AT      414360          0.1035982       0.0625000       1.6575709
CA      210420          0.0526092       0.0625000       0.8417465
CC      142762          0.0356933       0.0625000       0.5710931
CG      144577          0.0361471       0.0625000       0.5783537
CT      202697          0.0506783       0.0625000       0.8108521
GA      261528          0.0653872       0.0625000       1.0461946
GC      127832          0.0319605       0.0625000       0.5113684
GG      142319          0.0355826       0.0625000       0.5693210
GT      170194          0.0425519       0.0625000       0.6808298
TA      316888          0.0792283       0.0625000       1.2676521
TC      260191          0.0650529       0.0625000       1.0408462
TG      210158          0.0525437       0.0625000       0.8406984
TT      507927          0.1269918       0.0625000       2.0318685

Now I added two distance measures (Euclidean & Manhattan) which equal zero when the two vectors are identical. There is also a cosine similarity which will be 1 when the two vectors are identical.

For the vector you listed above vs. the nucleotide file I used, this is what you get:

 Euclidean distance:    0.138805
 Manhattan distance:    0.463708
 Cosine similarity:     0.871000

If you comment out your frequency vector and substitute with actual frequencies for this nucleotide file, this is what you get:

 Euclidean distance:    0.000000
 Manhattan distance:    0.000000
 Cosine similarity:     1.000000

Save the code below as dinuc_composition_compare.py and run:

python dinuc_composition_compare.py GCA_016930515.1_ASM1693051v1_genomic.fna

It will save a file with dinucleotide frequencies which you may not need, so feel free to comment out that part (the line that starts with df_dic.to_csv).

#!/usr/bin/python
import os
from math import sqrt
from decimal import Decimal
from sklearn.metrics.pairwise import manhattan_distances, euclidean_distances
import argparse
import pandas as pd
import numpy as np
from Bio import SeqIO

#############################################################################################
# https://dataaspirant.com/five-most-popular-similarity-measures-implementation-in-python/
#############################################################################################
def square_rooted(x):
    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):
    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)
    return round(numerator/float(denominator),3)

#############################################################################################
# https://www.geeksforgeeks.org/python-count-overlapping-substring-in-a-given-string/
#############################################################################################
def CountOccurrences(string, substring):

    '''
    ########################################################################
    str.count fails to count overlapping occurencies of the same word
    This function fixes that.
    ########################################################################
    '''
    # Initialize count and start to 0
    count = 0
    start = 0

    # Search through the string till
    # we reach the end of it
    while start < len(string):

        # Check if a substring is present from
        # 'start' position till the end
        pos = string.find(substring, start)

        if pos != -1:
            # If a substring is present, move 'start' to
            # the next position from start of the substring
            start = pos + 1

            # Increment the count
            count += 1
        else:
            # If no further substring is present
            break
    # return the value of count
    return count

NucLetter=['A','C','G','T']

#############################################################################################
def CalculateDinucComposition(NucleicSequence):
    '''
    ########################################################################
    Calculate the composition of dinucleotides for a given nucleic sequence.
    Usage:
    result=CalculateDinucComposition(nucleic)
    Input: nucleic is a pure nucleic sequence.
    Output: result is a dict form containing the composition of
    16 dinucleotides.
    ########################################################################
    '''

    len(NucleicSequence)
    Result={}
    for i in NucLetter:
        for j in NucLetter:
            Dinuc=i+j
            Result[Dinuc]=CountOccurrences(NucleicSequence,Dinuc)
    return Result

#############################################################################################

parser = argparse.ArgumentParser(
    description=
    '\n Calculate dinucleotide composition of nucleic sequences.\n',
    epilog='\n \n',
    formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument(
    'input_file', help='path to the input .fna file')

args = parser.parse_args()

if os.access(args.input_file, os.R_OK):
    FastaFile = open(args.input_file, 'rU')
else:
    parser.error(
        '\n !!! Input file "%s" does not exist in this directory !!!\n' %
        args.input_file)

id_code = args.input_file.rsplit('.', 1)[0]
csv_di_name = args.input_file.rsplit('.', 1)[0] + '_2N.csv'

w = 0
for rec in SeqIO.parse(FastaFile, 'fasta'):
    w = w + 1
    name = rec.id
    seq = str(rec.seq)
    DiC=CalculateDinucComposition(seq)
    if w == 1:
        df_dic = pd.DataFrame.from_dict(DiC,orient='index',dtype=np.int,columns=['count'])
    else:
        df_dicc = pd.DataFrame.from_dict(DiC,orient='index',dtype=np.int,columns=['count'])
        df_dic['count'] = df_dic['count'] + df_dicc['count']

aa_dic_sum = df_dic['count'].sum()
df_dic[id_code] = df_dic['count'] / aa_dic_sum

df_dic.drop(['count'], axis=1, inplace=True)
df_dic.sort_index(ascending=True, inplace=True)
dinuc_freqs = df_dic.values
df_dic = df_dic.transpose(copy=True)
df_dic.to_csv(csv_di_name, index=False, float_format='%.7f')
FastaFile.close()

# Compare two number vectors
target = np.array([0.041247485, 0.054325956, 0.054325956, 0.06639839, 0.079476861, 0.068410463, 0.00804829, 0.099597586, 0.048289738, 0.042253521, 0.047283702, 0.063380282, 0.046277666, 0.09054326, 0.092555332, 0.097585513])
#target = np.array([0.1283484,0.0424221,0.0512048,0.1035982,0.0526092,0.0356933,0.0361471,0.0506783,0.0653872,0.0319605,0.0355826,0.0425519,0.0792283,0.0650529,0.0525437,0.1269918])

print('\n Euclidean distance:\t%.6f' % euclidean_distances(dinuc_freqs.reshape(1, -1), target.reshape(1, -1)))
print(' Manhattan distance:\t%.6f' % manhattan_distances(dinuc_freqs.reshape(1, -1), target.reshape(1, -1)))
print(' Cosine similarity:\t%.6f\n' % cosine_similarity(dinuc_freqs, target))