Hello,
I have millions of lines in the following format:
number sequence
The number refers to the frequency at which the sequence appears in my library and the sequence is as you would expect. A typical line might look like:
25565 AGTGCATTTTGGTTTAGGCATGA
Thus, this particular and fictitious sequence shows up 25565 times in my library.
I need to manipulate this data in the following way:
1) Confirm that the final 5 letters are correct (in this case, CATGA) and if not, remove the line.
and then
2) Remove the final 5 letters from all of the sequences on every line.
I have been trying to figure out how to load this information into either python as a dictionary or directly into matlab.
It would be very helpful to know whether this feat would be best approached with matlab, python, or something else. Also, how would it be best to load the data from the text file into a dictionary in python?
Thanks!
I guess he/she might be looking for a primer or TAG
If the last 5 letters don't match an expected string, then the line must be discarded. This means that the sequences was misread and should not be considered. Thank you for your advice.
If
s
is a string in python. Then the last five letters are just:s[-5:]
So,