Edit. I added -w to the grep so it doesn't confuse e.q. Seq1, Seq11 and Seq111. The tab in echo is a literal one (ctrl+v+tab). If the file has 100s of thousands of lines or more, it will take a long time to finish.
I would go with a similar workflow, except that I would not store the result in a dictionary if the blast file is huge. Just output the key and value when a new key is encountered.
I agree, but I was just assuming that something like Seq 1 ... Seq 2... Seq 1 might happen. File rw cursors/streams cannot move backward, unfortunately. Maybe running it thru a sort pipeline and then your workflow would ensure both memory efficiency and accuracy.
@heikki5: that's super clever and comes in handy so often, I will definitely add that to my repertoire
I am especially interested in Python based solutions as I am currently progressing in learning it. I have asked this question since iam currently working on a little pipeline for NGS data processing. However, I am only ending up with an empty output file.
current = None
finalin = open(processed, 'rb')</code>
reader = csv.reader(finalin, delimiter=',')</code>
finalout = open(modified, 'wb')</code>
writer = csv.writer(finalout)</code>
for row in reader:
if len(row) < 3:
continue
if (row[0] == current) or current is None:
concat.append(row[1])
else:
concat = [row[1]]
writer.writerow(row[0] + (SEP.join(concat)))
In the beginning, we set a variable s to 0. Then, for each line, we check if the value of s differs from the column1 value ($1). If they are different, we assign the value of column1 to s and print it in a new line. If column1 and s have the same value, we print the value of the third column ($3) on the current output line. Once done with the entire file, we print a new line to wrap stuff up.
I'd have used a default value of empty string for s, but 0 works too, I guess.
This is genius! Runs thru the file multiple times though, but the approach is cool!