Question

extract columns from file based on header selected from 2nd file

1

Entering edit mode

9.4 years ago

joaofadista ▴ 50

I have one large text table, tab separated. The first row is a header. I then have a second text file that contains a subset of the headers in the first file. I want to extract all the columns of the first file whose header is contained in the list given in the second file. Here an example of the inputs and the desired output:

DATA.TXT

head0 head1 head2 head3 head4
1 25 1364 22 13
2 10 215 1 22

LIST.TXT

head0  
head4

Desired output:

head0 head4  
1 13  
2 22

text delimited • 13k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 9.4 years ago by joaofadista ▴ 50

0

Entering edit mode

Can we have a few actual examples of the header names? It's easy if they do actually end 0, 1, 2, 3 ... but I assume they don't?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by george.ry ★ 1.2k

0

Entering edit mode

How can i do it in R? Thanks

ADD REPLY • link 4.9 years ago by rse ▴ 100

0

Entering edit mode

How can i do it in R? Thanks

ADD REPLY • link 4.9 years ago by rse ▴ 100

1

Entering edit mode

9.4 years ago

george.ry ★ 1.2k

Ok, regardless of my comment above ... two bash commands should see you done ...

paste <(seq 1 $(head -1 DATA.TXT | wc -w))  <(head -1 DATA.TXT | tr '\t' '\n') > TMP
cat DATA.TXT | cut -f$(grep -wFf LIST.TXT TMP | cut -f1 | sed ':a;N;$!ba;s/\n/,/g') > OUTPUT

It's the sort of thing that's easier in R or something, though!

// EDIT // On a Mac (useless BSD rather than GNU tools!) apparently you'd have to alter the sed command there to ...

sed -e ':a' -e 'N' -e '$!ba' -e 's/\n/,/g'

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by george.ry ★ 1.2k

0

Entering edit mode

Thanks George, but it outputs this error message:

cut: option requires an argument -- 'f'

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by joaofadista ▴ 50

0

Entering edit mode

9.4 years ago

Antonio R. Franco ★ 5.2k

if you know the columns in advance, and depending upon the delimitator, you can simply use a cut command. If tabs are used, in this case you select the first and fifth column

cat DATA.TXT | cut -f 1,5 > output.file

This requires you to reveal the desired columns. If this is easy to check, see how easy the code is

If the delimiting is different to a tab, such as an empty space

cat DATA.TXT | cut -d " " -f 1,5 > output.file

I dare to put this here because of the simplicity of this code

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by Antonio R. Franco ★ 5.2k

Ram · Accepted Answer · 2015-11-19

4

Entering edit mode

9.4 years ago

iraun 6.2k

Just with a simple google search a hundred of questions/answers related to this topic can be easily found, specially, in stackoverflow: http://unix.stackexchange.com/questions/114829/display-column-that-has-a-specific-string

I'd suggest you to try to solve this kind of problems searching on internet and trying yourself before asking. In this case, you can get what you want with a little modification to the solution proposed on the link:

awk -F'\t' 'NR==FNR{arr[$1]++;next}{for(i=1; i<=NF; i++) if ($i in arr){a[i]++;}} { for (i in a) printf "%s\t", $i; printf "\n"}' file2 file1

ADD COMMENT • link updated 5.4 years ago by Ram 45k • written 9.4 years ago by iraun 6.2k

0

Entering edit mode

This works perfectly. However, if LIST.TXT contains headers which do not exist in DATA.TXT, say headX and headY, and their order is "head0, headX, head4, headY", can the output include null values for headX and headY columns and also print output in the same order as in LIST.TXT file?

ADD REPLY • link updated 5.4 years ago by Ram 45k • written 6.9 years ago by gollapudiv • 0