Question

Trouble locating a known motif on (-) strand: any hints for a beginner?

0

Entering edit mode

6.5 years ago

Christian • 0

Hello everyone.

This question may be rather dumb, but at this point I got really confused and a little stuck. I am a master's student in molecular biology and relatively inexperienced when it comes to computational problems. However, I wanted to give it a try to increase my range of skill.

My problem is as follows: I have a set of PSSMs and used the tool PoSSuMSearch to find matches for my genome of interest. I am interested in the neighborhood of the matches I found, so I wanted to use the positional coordinates given from the PSSM search to extract my region of interest from a genome file. This works fine if the motif I found is located on the forward strand. Troubles arises somewhere in my handling of the reverse strand: According to the manual file from PoSSumSearch (see figure on page 17), the position for each motif is located on 'left-most position of the motif'. It should look something like this:

#      | motif_on_plus
#    - - - - - - - - - - ->  (+) strand
#    0 1 2 3 4 5 6 7 8 9     position in genome 
#   <- - - - - - - - - -     (-) strand
#          | motif_on_minus

In that case my motif on the (+) strand would range from position 1 to 1+length(motif). I expect my motif on the (-) strand to range from 3 to 3+length(motif). I use some Python to check if that is valid:

from Bio import SeqIO

genome = SeqIO.read(genome_file, "genbank")

# Examplary lines of my data file after some sorting and filtering steps
example_lines = ["GCGACGGC;8036;0.002681536;CGTGCCG;8058;0.003927515;+",
                 "CGGCAGCC;28770;0.002681536;GAAGGGC;28749;0.003927515;-"]

for line in example_lines:
    seq, pos, _, _, _, _, strand = line.split(sep=";")
    pos = int(pos)

    print(strand)
    print(seq)

    if "+" in strand: # Prints corresponding region on (+) strand
        print(genome.seq[pos:pos+len(seq)])
    elif "-" in strand: # Prints corresponding region on (-) strand
        print(genome.seq[pos:pos+len(seq)].reverse_complement())

   print()

My genome file is downloaded from public data bases (chromosome of Sinorhizobium meliloti). I expected to get matching sequences for both strands (or some sequences with slightly shifted frames due to different indexing), but from what I saw so far is that the sequences for the (-) strand are just going wild. The output for this specific case is as follows:

+
GCGACGGC
GCGACGGC

-
CGGCAGCC
CCGACGGC

I have looked into whether the problem is with the reverse complement, e.g. that my motif match is given in forward orientation. But this also did not yield in the solution of my problem. Additionally, I checked a broader neighbor region for my motif on the (-) strand to see whether my motif was at least somewhere near. This also was not the case (except on apparently random occasions with short motifs rich in GC since my genome has GC content around 70%).

My question to you is: Did I make an obvious mistake? Is there something I completely overlooked? I am grateful for any hints you might have for me!

pssm genome python motif search • 1.3k views

ADD COMMENT • link updated 6.5 years ago by Bastien Hervé 5.9k • written 6.5 years ago by Christian • 0

0

Entering edit mode

Try to investigate but you just have the reverse not the complement.

The reverse of CGGCAGCC is actually CCGACGGC

ADD REPLY • link 6.5 years ago by Bastien Hervé 5.9k

score 0 · Answer 1 · 2018-06-15

I think I got something :

I downloaded your genbank file and I look at position 28770 of the sequence, the sequence is actually gccgtcgg not CGGCAGCC like in your example_lines variable.

Instead of print(seq) write print(genome.seq[pos:pos+len(seq)]) and you will see the problem

The minus strand line in your data file are already reverse complement compared to the reference sequence in your genbank file