Combine lines into one line after specific line identifier
3
0
Entering edit mode
8.6 years ago
IrK ▴ 100

Hi guys,

I have to read a file in (python) which looks like the following:

sequence: 1
cght ghgjt
sequence: 2
dsjhgfk kjhfds;kuhur
sequence: 3
ccccccccccccchhhhh
gggggggggggtt
sequence: 4
ffghjhjj
sequence: 5
sequence: 6

And i have to grab all the data into one line (see below). Basically 'sequence:' is my identifier that the next lines contain some information (or may none), but either way it has some information or not I need to make it into one line. Those lines that have nothing after 'sequence:' may be assign something like "none sequence" . How would you do this in Python? desired output:

sequence: 1  cght ghgjt
sequence: 2  dsjhgfk kjhfds;kuhur
sequence: 3  ccccccccccccchhhhh    gggggggggggtt
sequence: 4  ffghjhjj
sequence: 5 "None sequence"
sequence: 6  "None sequence"

thank you,

python read file • 17k views
ADD COMMENT
3
Entering edit mode

what have you tried ?

ADD REPLY
0
Entering edit mode

good point Pierre, I forgot to share with my achievements.

so far I manage:

file1=open('test2.txt','r')
lines=file1.readlines()

for i in xrange(len(lines)):
    current_line= lines[i].split()[0]
    rest_of_current_line=lines[i].split()[1]
    if current_line=='sequence:':
        next_l=lines[i+1].strip()
        a=[current_line,rest_of_current_line, next_l]   # get ONE LINE INFO
        print a

gives:

['sequence:', '1', 'cght ghgjt']
['sequence:', '2', 'dsjhgfk kjhfds;kuhur']
['sequence:', '3', 'ccccccccccccchhhhh']

However the info of sequence 4 is missing and empty lines !!!!

ADD REPLY
1
Entering edit mode
8.6 years ago
Zhilong Jia ★ 2.2k

using awk. For example awk 'BEGIN{RS="sequence:"}NR!=1{gsub("\n", " ", $0); print "sequence:" $0}' a.t

ADD COMMENT
0
Entering edit mode

thank you, Zhilong Jia, this works perfect! Could you please also explain the logic of your command, I am not an expert of Unix, therefore I can't apply this command to my real file, as it has more columns. my thoughts: awk - for pattern searching gsub -catches the end of the line

Thank you

p.s.: I would still like to get it in Python as well ;) I would like to get the idea of manipulating files with this type of task.

ADD REPLY
0
Entering edit mode

This idea is to consider sequence: as the new line separator, and substitute the \n as a space. more columns do not affect it. BTW, I believe you can find a short awk manual to get more.

ADD REPLY
0
Entering edit mode

thanks, its more clear I will check the manual for awk, hopefully little by little I will be able to write this sort of command lines

ADD REPLY
0
Entering edit mode
8.6 years ago
IrK ▴ 100
file1=open('test2.txt','r')
lines=file1.readlines()

for i in xrange(len(lines)):

    current_line= lines[i].split()[0]
    #print i, 'lines[i]', lines[i],'current--->',lines[i].split()[0]

    if current_line=='sequence:':

        rest_of_current_line=lines[i].split()[1]
        next_l=lines[i+1].split()[0]
        next_l2=lines[i+1].strip()
        if next_l=='sequence:':
            b=[current_line,rest_of_current_line, 'None']   # get ONE LINE INFO
            full.append(b)
            print b
        else:
            a=[current_line,rest_of_current_line, next_l2]
            print a
    else:
        pass

I manage to get required output but I still get an error:

    current_line= lines[i].split()[0]
IndexError: list index out of range
ADD COMMENT
0
Entering edit mode

print lines[i] prior to splitting it and see on which line the error produces. My guess is it gives the error on the last line which is empty. You can avoid this by having an if statement before you start spliting:

if not lines[i] == ""

In addition, you could fix that (if it is the problem) in the beginning with a list expression:

lines=[line for line in file1.readlines() if not line = ""]

In general, about your script, I would do it differently and just loop over the lines: (some pseudocode and partially functional code ahead)

templist = []
for line in lines:

store line by line in a temporary variable (append to list), write the previous list away when you encounter the next 'sequence'

if line.startswith('sequence'):
    if not len(templist) == 0: #This is to account for the first occurrence of sequence, your first line, in which templist is still empty
        print(' '.join(templist)) #write the previous result away
        templist = [line,] #Make templist empty again for the next round and store this line
else:
   templist.append(line) #Append everything which not starts with 'sequence'

Let me know if you need additional pointers.

ADD REPLY
0
Entering edit mode
8.6 years ago
IrK ▴ 100

sorry, I saw what I wanted to see two days ago ;( but actually my script doesn’t work properly. It goes into the next line after the line with "sequence:", but if there are more than one line than it doesn’t read second/third/fourth and etc lines after it.

And unfortunately I haven't gotten your ( decosterwouter) idea as well.

However, I found similar question and it seems to work this way. [http://stackoverflow.com/questions/4595197/how-to-grab-the-lines-after-a-matched-line-in-python][1]

I would appreciate if anyone knows how to improve current one, just give me an idea and I will try to implement it. Thanks

ADD COMMENT
1
Entering edit mode

This works for me:

import sys

with open(sys.argv[1]) as input:
    lines = [item.strip() for item in input.readlines() if not item == ""]
    templist = []
    for line in lines:
        if line.startswith('sequence'):
            if not len(templist) == 0:
                print(' '.join(templist))
                templist = [line,]
            else:
                templist = [line,]
        else:
            templist.append(line)
    else:
        print(' '.join(templist))

Notice: -the list comprehension to properly format the input

-the else clause on the for loop to also write the last line

-storing the objects in the templist and emptying this after writing away when the next 'sequence' is reached.

ADD REPLY
0
Entering edit mode

thank you ,

I have to familiarize myself with sys module, but thanks for the help.

ADD REPLY

Login before adding your answer.

Traffic: 2557 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6