Hello everyone.
This question may be rather dumb, but at this point I got really confused and a little stuck. I am a master's student in molecular biology and relatively inexperienced when it comes to computational problems. However, I wanted to give it a try to increase my range of skill.
My problem is as follows: I have a set of PSSMs and used the tool PoSSuMSearch to find matches for my genome of interest. I am interested in the neighborhood of the matches I found, so I wanted to use the positional coordinates given from the PSSM search to extract my region of interest from a genome file. This works fine if the motif I found is located on the forward strand. Troubles arises somewhere in my handling of the reverse strand: According to the manual file from PoSSumSearch (see figure on page 17), the position for each motif is located on 'left-most position of the motif'. It should look something like this:
# | motif_on_plus
# - - - - - - - - - - -> (+) strand
# 0 1 2 3 4 5 6 7 8 9 position in genome
# <- - - - - - - - - - (-) strand
# | motif_on_minus
In that case my motif on the (+) strand would range from position 1 to 1+length(motif). I expect my motif on the (-) strand to range from 3 to 3+length(motif). I use some Python to check if that is valid:
from Bio import SeqIO
genome = SeqIO.read(genome_file, "genbank")
# Examplary lines of my data file after some sorting and filtering steps
example_lines = ["GCGACGGC;8036;0.002681536;CGTGCCG;8058;0.003927515;+",
"CGGCAGCC;28770;0.002681536;GAAGGGC;28749;0.003927515;-"]
for line in example_lines:
seq, pos, _, _, _, _, strand = line.split(sep=";")
pos = int(pos)
print(strand)
print(seq)
if "+" in strand: # Prints corresponding region on (+) strand
print(genome.seq[pos:pos+len(seq)])
elif "-" in strand: # Prints corresponding region on (-) strand
print(genome.seq[pos:pos+len(seq)].reverse_complement())
print()
My genome file is downloaded from public data bases (chromosome of Sinorhizobium meliloti). I expected to get matching sequences for both strands (or some sequences with slightly shifted frames due to different indexing), but from what I saw so far is that the sequences for the (-) strand are just going wild. The output for this specific case is as follows:
+
GCGACGGC
GCGACGGC
-
CGGCAGCC
CCGACGGC
I have looked into whether the problem is with the reverse complement, e.g. that my motif match is given in forward orientation. But this also did not yield in the solution of my problem. Additionally, I checked a broader neighbor region for my motif on the (-) strand to see whether my motif was at least somewhere near. This also was not the case (except on apparently random occasions with short motifs rich in GC since my genome has GC content around 70%).
My question to you is: Did I make an obvious mistake? Is there something I completely overlooked? I am grateful for any hints you might have for me!
Try to investigate but you just have the reverse not the complement.
The reverse of
CGGCAGCC
is actuallyCCGACGGC