Question

awk specific line from another file

1

Entering edit mode

6.4 years ago

Hughie ▴ 30

Hello! everyone:

I'm new to linux, here I got a problem:

I have a file file1 like:

and file2 which is tab-delimited:

chr1    3052600 3052800 1       E3  
chr1    3052800 3053000 2       E3  
chr1    3059400 3059600 3       E3  
chr1    3059600 3059800 4       E3  
chr1    3059800 3060000 5       E3  
chr1    3062600 3062800 6       E3  
chr1    3101000 3101200 7       E3  
chr1    3105000 3105200 8       E3  
chr1    3105200 3105400 9       E3  
chr1    3116800 3117000 10      E2  
chr1    3117000 3117200 11      E2  
chr1    3164800 3165000 12      E2

and I want to extract the lines in file2 which its 4-th column equal the number in file1 like below:

chr1 3059400 3059600 3 E3   
chr1 3062600 3062800 6 E3     
chr1 3101000 3101200 7 E3   
chr1 3105200 3105400 9 E3   
chr1 3164800 3165000 12 E2

I have spent several hours including wrote a very slow python script, and I searched for the oneline solution, but I got nothing!

awk -v FS="\t" 'NR==FNR{rows[$1]++;next}(substr($NF,1,length($NF)-1) in rows)'  fiel1 file2

Thanks a lot for some suggestions!

awk formatting • 4.2k views

ADD COMMENT • link 6.4 years ago by Hughie ▴ 30

1

Entering edit mode

Hello Hughie,

Please use appropriate tags. Your question is about formatting and awk. That should have been a tag when you created the question. When you add appropriate tags, users that follow the tag (usually experts interested in helping others in that subject matter) get notified of your question, and this means you stand a better chance at getting a relevant, useful response faster.

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

0

Entering edit mode

Thank you Wouter!
I have revised the tag

ADD REPLY • link 6.4 years ago by Hughie ▴ 30

2

Entering edit mode

The appropriate tags that Wouter mentioned are formatting and awk, not the one out-of-place formatting and awk tag.

ADD REPLY • link 6.4 years ago by Ram 44k

0

Entering edit mode

Thank you !
Revised again!

ADD REPLY • link 6.4 years ago by Hughie ▴ 30

5

Entering edit mode

6.4 years ago

finswimmer 16k

Hello Hughie,

but there are some problem

you should describe what these problems are.

Nevertheless this should work:

$ awk 'NR==FNR {a[$1]; next} $4 in a {print $0}' file1 file2

fin swimmer

ADD COMMENT • link 6.4 years ago by finswimmer 16k

4

Entering edit mode

Further shortening awk solution:

$ awk 'NR==FNR {a[$1]++} a[$4]' file1 file2
chr1    3059400 3059600 3   E3
chr1    3062600 3062800 6   E3
chr1    3101000 3101200 7   E3
chr1    3105200 3105400 9   E3
chr1    3164800 3165000 12  E2

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

Nice!

Could you please explain why this works?

fin swimmer

ADD REPLY • link 6.4 years ago by finswimmer 16k

2

Entering edit mode

@finswimmer: {a[$1]++} = makes an array of first column from first file (from NR==FNR). Since it is single column 1 or 0 should not matter and no other commands as well. When it moves on second file, it makes index from column 4 ($4) from second file. Matches with those lines from array from first step (a[$4]), then prints matching lines from file 2.

Actually it is collapsed from your awk solution above:

awk 'NR==FNR {a[$1]; next} $4 in a {print $0}' file1 file2
awk 'NR==FNR {a[$1]++} $4 in a {print $0}' file1 file2
awk 'NR==FNR {a[$1]++} $4 in a' file1 file2
awk 'NR==FNR {a[$1]++} a[$4]' file1 file2

ADD REPLY • link 6.4 years ago by cpad0112 21k

1

Entering edit mode

Hello cpad0112,

thanks a lot for this explanation. I didn't know that awk skippes the part after {...} without using next.

fin swimmer

ADD REPLY • link 6.4 years ago by finswimmer 16k

1

Entering edit mode

Np. You will learn all these tricks from Pierre, Kevin and other stars awk one liners, the way I did.

ADD REPLY • link 6.4 years ago by cpad0112 21k

1

Entering edit mode

I investigate a little bit more on how this works and I think I got it.

NR==FNR {a[$1]++} makes in array with the first column from the first file as index and assign an positiv number to it (increment with ++).

The part after {...} isn't realy skipped. It is the next condition! If it evaluate to true than the line, which is currently processed by awk, is printed. As the first file haven't a 4th column, this always will be false and nothing will be printed. If we reach the second file, a[$4] evaluates to true if the value from the 4th column is an index if a and returns not 0. This is why we need to increment before (a[$1]=1 would be fine as well).

fin swimmer

ADD REPLY • link 6.4 years ago by finswimmer 16k

0

Entering edit mode

Hi, this gives weird output with the following example. In f1.txt, I do not have '12' but still the output provides a row with '12'.

cat f1.txt
3
6
7
9

cat f2.txt

chr1    3052600 3052800 1       E3
chr1    3052800 3053000 2       E3
chr1    3059400 3059600 3       E3
chr1    3059600 3059800 4       E3
chr1    3059800 3060000 5       E3
chr1    3062600 3062800 6       E3
chr1    3101000 3101200 7       E3
chr1    3105000 3105200 8       E3
chr1    3105200 3105400 9       E3
chr1    3116800 3117000 10      E2
chr1    3117000 3117200 11      E2
chr1    3164800 3165000 12      E2

awk 'NR==FNR {a[$1]++} a[$4]' f1.txt f2.txt

chr1    3059400 3059600 3       E3
chr1    3062600 3062800 6       E3
chr1    3101000 3101200 7       E3
chr1    3105200 3105400 9       E3
chr1    3164800 3165000 12      E2

ADD REPLY • link 6.4 years ago by EagleEye 7.6k

0

Entering edit mode

In OP, first file has 12

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

Yes but with my example above it did not work. I did not have 12 in my 'f1.txt'.

ADD REPLY • link 6.4 years ago by EagleEye 7.6k

0

Entering edit mode

Oops sorry it works fine. I did a stupid mistake.

ADD REPLY • link 6.4 years ago by EagleEye 7.6k

0

Entering edit mode

Thank you cpad0112 for your nice answer! question sloved!

ADD REPLY • link 6.4 years ago by Hughie ▴ 30

0

Entering edit mode

Thank you fin swimmer for your reply!

ADD REPLY • link 6.4 years ago by Hughie ▴ 30

1

Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY • link 6.4 years ago by WouterDeCoster 47k

score 5 · Accepted Answer · 2018-06-26

5

Entering edit mode

6.4 years ago

5heikki 11k

This is actually a job for join

join -t $'\t' -1 1 -2 4 -o 2.1,2.2,2.3,2.4,2.5 \
    <(sort -g file1) \
    <(sort -t $'\t' -k4,4g file2) \
    > out

ADD COMMENT • link 6.4 years ago by 5heikki 11k

0

Entering edit mode

Thank you 5heikki! This is a new solution

ADD REPLY • link 6.4 years ago by Hughie ▴ 30

score 5 · Accepted Answer · 2018-06-26

5

Entering edit mode

6.4 years ago

shenwei356 8.7k

simple with https://github.com/shenwei356/csvtk

csvtk grep -H -t -f 4 -P file1 file2 > result

ADD COMMENT • link 6.4 years ago by shenwei356 8.7k

0

Entering edit mode

Thank you! I will try this

ADD REPLY • link 6.4 years ago by Hughie ▴ 30

score 5 · Accepted Answer · 2018-06-26

5

Entering edit mode

6.4 years ago

EagleEye 7.6k

I assume your file2 is TAB-delimited.

sed -i 's/^/\t/' file1.txt
sed -i 's/$/\t/' file1.txt
grep -fF file1.txt file2.txt > combined.txt

ADD COMMENT • link 6.4 years ago by EagleEye 7.6k