Question

Manipulation Of Large Numbers Of Sequences Using Python Or Matlab

0

Entering edit mode

11.5 years ago

zeropoint1 ▴ 10

Hello,

I have millions of lines in the following format:

number sequence

The number refers to the frequency at which the sequence appears in my library and the sequence is as you would expect. A typical line might look like:

25565 AGTGCATTTTGGTTTAGGCATGA

Thus, this particular and fictitious sequence shows up 25565 times in my library.

I need to manipulate this data in the following way:

1) Confirm that the final 5 letters are correct (in this case, CATGA) and if not, remove the line.

and then

2) Remove the final 5 letters from all of the sequences on every line.

I have been trying to figure out how to load this information into either python as a dictionary or directly into matlab.

It would be very helpful to know whether this feat would be best approached with matlab, python, or something else. Also, how would it be best to load the data from the text file into a dictionary in python?

Thanks!

python matlab ngs • 4.3k views

ADD COMMENT • link updated 11.5 years ago by Song Qiang ▴ 40 • written 11.5 years ago by zeropoint1 ▴ 10

score 2 · Answer 1 · 2013-10-22

2

Entering edit mode

11.5 years ago

Damian Kao 16k

If you have millions of lines, it is better to stream through the file one line at a time rather than to read the entire file into a data structure (python dictionary, array...). You can do this with something like:

inFile = open('inputFile.txt')
for line in inFile:
    data = line.strip().split()
    count = int(data[0])
    sequence = data[1]
    #do something with your count and sequence variables.

I don't understand what you mean by "Confirm the final 5 letters are correct"? What do you mean by correct?

ADD COMMENT • link 11.5 years ago by Damian Kao 16k

0

Entering edit mode

I guess he/she might be looking for a primer or TAG

ADD REPLY • link 11.5 years ago by Biojl ★ 1.7k

0

Entering edit mode

If the last 5 letters don't match an expected string, then the line must be discarded. This means that the sequences was misread and should not be considered. Thank you for your advice.

ADD REPLY • link 11.5 years ago by zeropoint1 ▴ 10

1

Entering edit mode

If s is a string in python. Then the last five letters are just: s[-5:] So,

if (sequence[-5:] == 'CATGA'): #do something

ADD REPLY • link 11.5 years ago by KCC ★ 4.1k

score 2 · Answer 2 · 2013-10-22

2

Entering edit mode

11.5 years ago

Song Qiang ▴ 40

You may use a sed one-liner. Suppose the input file is in.txt and the output file is out.txt, run

sed -n '/CATGA$/ s/CATGA$//p' < in.txt > out.txt

ADD COMMENT • link 11.5 years ago by Song Qiang ▴ 40

0

Entering edit mode

very nice sed one liner!

ADD REPLY • link 11.5 years ago by Ming Tommy Tang ★ 4.6k