Hi guys,
I made a script that works very well he search ID's from other file and compare with genome sequence file and the output is when its match they print to a another file.
I run this script for different files of ID's and its fine, until now! Seems that my regular expression don't match with one specific ID and I don't know why!
#This is the regex
$key =~ m/^>([A-Z]+[0-9]+[A-Z]+(\-[A-Z])*).+$/o
my $header_sub = $1;
And the ID that doesn't match is:
>YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication"
ATGGGATTGACTACTAAACCTCTATCTTTGAAAGTTAACGCCGCTTTGTTCGACGTCGACGGTACCATTATCATCTCTCAACCAGCCATTGCTGCATTCTGGAGGGATTTCGGTAAGGACAAACCTTATTTCGATGCTGAACACGTTATCCAAGTCTCGCATGGTTGGAGAACGTTTGATGCCATTGCTAAGTTCGCTCCAGACTTTGCCAATGAAGAGTATGTTAACAAATTAGAAGCTGAAATTCCGGTCAAGTACGGTGAAAAATCCATTGAAGTCCCAGGTGCAGTTAAGCTGTGCAACGCTTTGAACGCTCTACCAAAAGAGAAATGGGCTGTGGCAACTTCCGGTACCCGTGATATGGCACAAAAATGGTTCGAGCATCTGGGAATCAGGAGACCAAAGTACTTCATTACCGCTAATGATGTCAAACAGGGTAAGCCTCATCCAGAACCATATCTGAAGGGCAGGAATGGCTTAGGATATCCGATCAATGAGCAAGACCCTTCCAAATCTAAGGTAGTAGTATTTGAAGACGCTCCAGCAGGTATTGCCGCCGGAAAAGCCGCCGGTTGTAAGATCATTGGTATTGCCACTACTTTCGACTTGGACTTCCTAAAGGAAAAAGGCTGTGACATCATTGTCAAAAACCACGAATCCATCAGAGTTGGCGGCTACAATGCCGAAACAGACGAAGTTGAATTCATTTTTGACGACTACTTATATGCTAAGGACGATCTGTTGAAATGGTAA
I have tried several things include delete the ID and write again... I checked all the phases from my script and its here on match thing that "disappear"!
I will be very grateful if you help me!
Cheers
The match seems to work just fine but when I run do not retrieve the sequence that I want! :(
Your regex works on a single line. For me, it works if I use it only on the header (without the sequence (ATGGGATTGACTACTAA...). Is the sequence part of the ID? Or did something go wrong in splitting HEADER and SEQUENCE in your script before the regex part.
Yes regex works on single line its just for header and the
$header_sub
variable only select the match for IDs example YER062C. my script works very well for others files that contains more than 300 IDs and search in genome file with more than 6000 sequences. The script retrieve all sequences ID's from the another file that contains YER062C except the this ID!Okay, but the problem is that I cannot reproduce the error. As I said, it depends on what is in
$key
. If I only use>YER062C
or>YER062C GPP2 SGDID:S000000864, Chr V from 280682-279930, Genome Release 64-2-1, reverse complement, Verified ORF, "DL-glycerol-3-phosphate phosphatase involved in glycerol biosynthesis; also known as glycerol-1-phosphatase; induced in response to hyperosmotic or oxidative stress, and during diauxic shift; GPP2 has a paralog, GPP1, that arose from the whole genome duplication"
, your regex does work. So to be able to help, I need to know exactly, what your script does / what$key
contains and how you read your files.Can I sent to your mail my script and my files?
Put the files in a dropbox and share them if you want. And specify exactly what you want to select
Sure, for mail see my profile