using a loop to extract data from a text file and outputting it as a new file
1
0
Entering edit mode
2.3 years ago
matt81rd ▴ 10

Hi i need to extract a certain portion from a file i have just created and output it to a new file. I need to be able to loop through the file as there are many data points i need to extract.

I need to extract the information under the name column: 916830_H20130029501-2. I know sed, awk or grep are probably the best ways to do this but am unsure of what the pattern would look like due to the nature of the input file below:

H194880489
 id  |         name          | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-----------------------+-----+-----+-----+-----+-----+------+------
 745 | 882730_H19488048901-2 | 638 | 597 | 325 | 300 | 153 |   93 |   93
 715 | 850922_H19488048901-2 | 638 | 597 | 325 | 300 | 153 |   93 |   93
(2 rows)

H194660490
 id  |         name          | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-----------------------+-----+-----+-----+-----+-----+------+------
 709 | 842927_H19466049001-2 | 632 | 592 | 559 | 233 |   6 |    6 |    6
(1 row)

H194620465
 id  |         name          | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-----------------------+-----+-----+-----+-----+-----+------+------
 707 | 841499_H19462046501-1 | 630 | 590 | 557 | 486 | 378 |  186 |   68
(1 row)

H194420367
 id  |          name           | t0  | t5  | t10 | t25 | t50 | t100 | t250 
-----+-------------------------+-----+-----+-----+-----+-----+------+------
 703 | 833390_H19442036701-2   | 626 | 587 | 555 | 484 | 312 |   36 |   19
 739 | 882806_H19442036703-2   | 653 | 587 | 555 | 484 | 312 |   36 |   19
 756 | 882806_H19442036703_v-1 | 653 | 587 | 555 | 484 | 312 |   36 |   19
(3 rows)

As you can see sometimes there are data points sometimes with no information to extract and sometimes they have two or even three data points under name and i only need the first one.

The format of the output would look something like this:

882730_H19488048901-2
842927_H19466049001-2
841499_H19462046501-1
833390_H19442036701-2

Any help will be greatly appreciated :)

sed grep awk • 785 views
ADD COMMENT
1
Entering edit mode
$ awk 'FNR == 3' file.txt | sed 's/ //g' | cut -d '|' -f 2
  1. only print the 3rd row
  2. remove all blanks so we can use cut
  3. cut.
ADD REPLY
0
Entering edit mode

Are the files tab-delimited or in format with |, -, and +?

from a file i have just created

Since you create the files, you can easily output to any other formats including what you want, the first appeared names.

ADD REPLY
0
Entering edit mode
2.3 years ago
Joe 21k

I'd suggest not to use sed etc for this task.

You could apply the approach here and robustly tabulate the whole file with pandas.read_fwf:

https://github.com/jrjhealey/bioinfo-tools/blob/master/tabulateHHpred.py#L41-L55

ADD COMMENT

Login before adding your answer.

Traffic: 2554 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6