Need some help to develop some python code to analyze kmer data
1
0
Entering edit mode
11 months ago
schlogl ▴ 160

I am working in a code for doing some statistical analysis in kmers. It is kind of over-underrepresented kmers in dna sequences. The statistical implementation takes in consideration all sub-words that are component inside the kmer/n-grams/n-tuples in analysis, like dimers (two letters words), trimers (three letters words), tetramers... I will called simply words1 and words2

I need help to implement a function or functions to help me out to:

1- receive as input two dna sequences (string1 and string2) and one alphabet (ACGT).

2- Compare the characters present in string2 that are present in string1 and get the index of each character present. I have this function for that:

  def get_shared_idxs(word1, word2):
    """
    > kmer = "ACGCAT"
    > get_shared_idxs(kmer, "AA")
    [0, 4]
    """
    shared = []
    for i, c in enumerate(word1):
        if c in word2:
            shared.append(i)
    return shared

2- Get a combinations of sub-words, for example: list_sub_words = ['AC','AG','AC','AA','AT','CG','CC','CA','CT', 'GC','GA','GT','CA','CT','AT']. I have this function to get all the subwords(words2) I need

def get_kmer_combo(kmer, alphabet, repeats=1):
    """
    > alphabet = "ACGT"
    > kmer = "ACGCAT"
    >get_kmer_combo(kmer, alphabet, repeats=1)
    ['AC', 'AG', 'AC', 'AA', 'AT', 'CG', 'CC', 'CA', 
    'CT', 'GC', 'GA', 'GT', 'CA', 'CT', 'AT']
    """
    return ["".join(c) for c in combinations(kmer, r=repeats)]

3- Then I compare both words (words1 and words2) and find in words2 the gap positions represented by index missing in words2 that are not in word1. For example, word2 = "AA", indices that are in word1 [0,4], so to AA x ACGCAT -> A CGC A It miss indices [1,2,3]. I have this function:

    def get_missing_idxs(word1, present_idxs):
        kmer_idxs = [i for i in range(len(word1))]
        return set(kmer_idxs).difference(set(present_idxs))
# for "AA"
    missing_index = [i for i in get_missing_idxs(kmer, [0,4]) if i < max([0,4])]
[1, 2, 3]

3- Then with the missing indexes I can generate generate combinations with the alphabet letters (according to the length of the missing positions in the case of "AA" = [1, 2, 3] ) and use it to fulfill the gaps or the missing positions in string2 (AA to get AnnnA = [AAAAA, AAACA, AAAGA, AAATA,AACAA...] until return the 4^3 combinations.

4- If for example the indices are AC = [0,1] I check for this consecutive indices

def is_consecutive(indices):
    """
    > is_consecutive([0,2,3]), is_consecutive([2,3,4])
    (False, True)
    """
    return sorted(indices) == list(range(min(indices), max(indices) + 1))

If the indices are consecutives I return the word2 as it is or I do not need fill it.

I could hard code some functions, but I want to generalize the code.Because I need all duplets, triplets, quadruplets and pentamers, from the word1 as show in the examples below. Take a look in some examples:

# number of combinations is the number of sub-words needed

**string1** = ACGCAT 
**indices** = [0,1,2,3,4,5]
**alphabet** = "ACGT"

**string2**    **indices**    `number of combinations`
  AC            [0,1]          1
  AG            [0,2]          4 # AAG ACG AGG ATG
  AC            [0,3]         16 # AAAC AACC AAGC AATC ...
  AA            [0,4]         64
  AT            [0,5]        256
  CA            [1,4]         16
  CT            [1,5]         64
  CG            [1,2]          1
  CC            [1,3]          4
  GC            [2,3]          1
  GA            [2,4]          4
  GT            [2,5]         16
  CA            [1,4]         16
  CT            [1,5]         64
  AC            [0,3]         16
  ACG           [0,1,2]        1
  ACC           [0,1,3]        4
  ACA           [0,1,4]       16
  ACT           [0,1,5]       64
  AGC           [0,2,3]        4
  AGA           [0,2,4]       16
  AGT           [0,2,5]       64
  ACA           [0,1,4]       16
  ACT           [0,1,5]       64
  AAT           [0,4,5]       64
  CGC           [1,2,3]        1
  CGA           [1,2,4]        4
  CGT           [1,2,5]       16
  CnA           [1,3,4]        4
  CCT           [1,3,5]       16
  CAT           [1,4,5]       16
  GCA           [2,3,4]        1
  GCT           [2,3,5]        4
  GAT           [2,4,5]        4
  CAT           [3,4,5]        1
  ACGC          [0,1,2,3]      1
  ACGA          [0,1,2,4]      4
  ACGT          [0,1,2,5]     16
  ACCA          [0,1,3,4]      4 
  ACCT          [0,1,3,5]     16
  ACAT          [0,1,4,5]     16
  AGCA          [0,2,3,4]      4
  AGCT          [0,2,3,5]     16
  AGAT          [0,2,4,5]     16
  ACAT          [0,1,4,5]     16
  CGCA          [1,2,3,4]      1
  CGCT          [1,2,3,5]      4
  CGAT          [1,2,4,5]      4
  CCAT          [1,3,4,5]      4
  GCAT          [2,3,4,5]      1
  ACGCA         [0,1,2,3,4]    1
  ACGCT         [0,1,2,3,5]    4
  ACGAT         [0,1,2,4,5]    4
  ACCAT         [0,1,3,4,5]    4
  AGCAT         [0,2,3,4,5]    4
  CGCAT         [1,2,3,4,5]    1

If you guys have some tips and want to share it, all help will be very well accepted.

Thank you for your time and help.

Observation: The "ns" were inserted in the string2 are just illustrative and they are not in the real sequences. They are the missing positions or gaps I need to fulfill with the letters of the alphabet.

string2    indices    number of combinations(4^n)
  AC        [0,1]          1 # consecutive indices return AC
  AnG       [0,2]          4 # AAG,ACG,AGG,ATG
  AnnC      [0,3]         16 #AAAC AACC AAGC AATC ACAC ACCC...
  AnnnA     [0,4]         64 # AAAAA AAACA AAAGA AAATA ...
  AnnnnT    [0,5]        256 # AAAAAT AAAACT AAAAGT AAAATT...
  CnnA      [1,4]         16
  CnnnT     [1,5]         64
  CG        [1,2]          1
  CnC       [1,3]          4
  GC        [2,3]          1
  GnA       [2,4]          4
  GnnT      [2,5]         16
  CnnA      [1,4]         16
  CnnnT     [1,5]         64
  AnnC      [0,3]         16
  ACG       [0,1,2]        1
  ACnC      [0,1,3]        4
  ACnnA     [0,1,4]       16
  ACnnnT    [0,1,5]       64
  AnGC      [0,2,3]        4
  AnGnA     [0,2,4]       16
  AnGnnT    [0,2,5]       64
  ACnnA     [0,1,4]       16
  ACnnnT    [0,1,5]       64
  AnnnAT    [0,4,5]       64
  CGC       [1,2,3]        1
  CGnA      [1,2,4]        4
  CGnnT     [1,2,5]       16
  CnCA      [1,3,4]        4
  CnCnT     [1,3,5]       16
  CnnAT     [1,4,5]       16
  GCA       [2,3,4]        1
  GCnT      [2,3,5]        4
  GnAT      [2,4,5]        4
  CAT       [3,4,5]        1
  ACGC      [0,1,2,3]      1
  ACGnA     [0,1,2,4]      4
  ACGnnT    [0,1,2,5]     16
  ACnCA     [0,1,3,4]      4 
  ACnCnT    [0,1,3,5]     16
  ACnnAT    [0,1,4,5]     16
  AnGCA     [0,2,3,4]      4
  AnGCnT    [0,2,3,5]     16
  AnGnAT    [0,2,4,5]    16
  ACnnAT    [0,1,4,5]    16
  CGCA      [1,2,3,4]     1
  CGCnT     [1,2,3,5]     4
  CGnAT     [1,2,4,5]     4
  CnCAT     [1,3,4,5]     4
  GCAT      [2,3,4,5]     1
  ACGCA     [0,1,2,3,4]   1
  ACGCnT    [0,1,2,3,5]   4
  ACGnAT    [0,1,2,4,5]   4
  ACnCAT    [0,1,3,4,5]   4
  AnGCAT    [0,2,3,4,5]   4
  CGCAT     [1,2,3,4,5]   1
Python R • 515 views
ADD COMMENT
0
Entering edit mode
11 months ago

I think you can look at a known algorithm such as knuth-morris-pratt for some inspiration. Aho-corasick might be worth a look as well.

eg https://github.com/TheAlgorithms/Python/blob/master/strings/knuth_morris_pratt.py

You can look at over and under-representation of kmers using a chi-squared stats test.

ADD COMMENT
0
Entering edit mode

colindaven Hi there. Thanks for your replay. The statistics I am trying to emulate are the ones karlin and co-workers karlin, Elhai Elhai used a while ago and also Pevzners et al., where they take in account all sub-words in consideration.

ADD REPLY

Login before adding your answer.

Traffic: 3099 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6