Find Matched records in Multiple Files based on first column
2
0
Entering edit mode
8.0 years ago

I have three files with multiple columns but the I want to find the common first column entries. The desired output is at the end of the post. The data entries in files are displayed below.

First file
"OG0000000"
"OG0000001"
"OG0000003"
"OG0000004"
"OG0000005"

second file
"OG0000000"
"OG0000003"
"OG0000004"
"OG0000005"
"OG0000006"


Third file
"OG0000000"
"OG0000004"
"OG0000005"
"OG0000006"
"OG0000007"

output file
"OG0000000"
"OG0000004"
"OG0000005"
awk • 1.5k views
ADD COMMENT
0
Entering edit mode

I want to find the common first column entries

That means there are more columns, not shown in your example?

ADD REPLY
0
Entering edit mode

Yes there are multiple columns but I want to take out the list of the first column in a separate file so that I can grep them from individual files later.

ADD REPLY
0
Entering edit mode

So it's not a problem to lose the other columns now? In that case, something like this: (not tested)

cat <(cut -f1 file1.txt) <(cut -f1 file2.txt) | sort | uniq -d > file12.common.txt
cat file12.common.txt <(cut -f1 file3.txt) | sort | uniq -d > file123.common.txt

But this assumes your file is tab delimited.

ADD REPLY
2
Entering edit mode
8.0 years ago

I'll give you a quick but 'dirty' solution:

cat file1 file2 fil3 | cut -f1 | sort |  uniq -c | grep "3 " |  cut -d "\"" -f2
ADD COMMENT
0
Entering edit mode

If you sort uniq first column from each file separetely first, then this is great idea, otherwise it assumes each file has only single entry for every element in the first columns. So if the first file was:

"OG0000000"
"OG0000001"
"OG0000003"
"OG0000003"
"OG0000005"

Instead. Your command line would output the result below, right?

"OG0000000"
"OG0000003"
"OG0000005"
ADD REPLY
1
Entering edit mode

Sure. That's why it's a quick n "dirty" solution. No bioinformatician will hard code grep "3 " in his/her program :)

ADD REPLY
0
Entering edit mode
7.6 years ago

One command using csvtk for tab-delimited files. Usage of csvtk inter.

$ csvtk inter --no-header-row --tabs file*
OG0000000
OG0000004
OG0000005
ADD COMMENT

Login before adding your answer.

Traffic: 1836 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6