Question

removing lines that don't match by grep

0

Entering edit mode

7.1 years ago

vinayjrao ▴ 260

Hello, I have a file containing gene names of interest (24423 genes), and another file containing the lengths to all the genes (41306 genes). I want the lengths only to the 24424 genes, but when I grep using grep -wf file1 file2 or even fgrep -Fwf file1 file2, I get some excess genes, because some genes in my list may contain only the sense or the anti-sense strands, whereas if the reference file may contain both, and that is being reflected.

I want to know if there is a way to remove from the reference file (file2) all the lines that don't match?

Thank you.

P.S. The question is also on stackoverflow.com

edit -

file1

A1BG

A1BG-AS1

TSPAN6

MYB

MYB-AS1

file2

A1BG 2941

A1BG-AS1 560

TSPAN6 7923

MYB-AS1 362

MYB-AS2 713

MYB-AS3 396

desired_output

A1BG 2941

A1BG-AS1 560

TSPAN6 7923

MYB-AS1 362

But I always get MYB-AS2 and MYB-AS3

grep file handling • 2.4k views

ADD COMMENT • link updated 7.1 years ago by michael.ante ★ 4.0k • written 7.1 years ago by vinayjrao ▴ 260

0

Entering edit mode

and you'll soon get some negative votes on stackoverflow because you don't show any sample of your files.

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

Hi, can you post example of your file1, file2 and desire output?

ADD REPLY • link 7.1 years ago by Paul ★ 1.5k

3

Entering edit mode

7.1 years ago

michael.ante ★ 4.0k

Hi a simple join would be sufficient:

join file1 file2

ADD COMMENT • link 7.1 years ago by michael.ante ★ 4.0k

0

Entering edit mode

I tried this solution too, but I did not get the desired result. It gave me lesser number of genes as compared to the awk output. How exactly does it work?

ADD REPLY • link 7.1 years ago by vinayjrao ▴ 260

3

Entering edit mode

It compares the first column of both files. Both files should be in the same order. If they are not, you'll need to sort them : join <(sort file1) <(sort -k1,1 file2)

[EDIT] It works with your example data

ADD REPLY • link 7.1 years ago by michael.ante ★ 4.0k

0

Entering edit mode

this is the correct answer.

ADD REPLY • link 7.1 years ago by Pierre Lindenbaum 166k

score 1 · Accepted Answer · 2018-02-21

1

Entering edit mode

7.1 years ago

Paul ★ 1.5k

Hi, what about awk solution:

awk 'FNR==NR {a[$1]; next} $1 in a' file1 file2

Desire output:

A1BG    2941
A1BG-AS1    560
TSPAN6  7923
MYB-AS1 362

ADD COMMENT • link 7.1 years ago by Paul ★ 1.5k