Merging two linux files based on same rows
3
0
Entering edit mode
3.6 years ago
salman_96 ▴ 70

Hi, I have two files, one file (file1.txt) has snp coordinates and rs IDs. This file is obtained from Hg19 file.

1:10150:C:T     rs371194064

1:10165:A:AC    rs796884232

1:10177:A:AC    rs367896724

1:10177:A:C     rs201752861

1:10180:T:C     rs201694901

1:10199:A:T     rs905327004

1:10228:TA:T    rs143255646

The other file (file2.txt) that I have has coordinates only and looks like this

1:838916:A:T

1:839461:T:C

1:839495:G:T

1:839528:A:G

1:839529:T:G

1:840353:G:C

I want to merge both files based on identical coordinates between them. The first file is a large file where as the second file has a few million snps. i want to find the snp rsIDs for the second file by merging it with the first

I have used this command but it generates a file that has everything in it including the content of file 2 but I only want the matching rows between them.

paste -d " "  file1.txt file2.txt > Final_rsIDs.txt
rsID coordinates plinkfile Merge • 1.4k views
ADD COMMENT
0
Entering edit mode
$ join -1 1 -2 1 file1.txt file2.txt

join needs file to be sorted. If program throws an error on sorting, sort both the files and join.

ADD REPLY
0
Entering edit mode
3.6 years ago
Zhilong Jia ★ 2.2k

awk 'NR==FNR{data[$1]=$2; next}{print ($1, data[$1]) }' file1.txt file2.txt

ADD COMMENT
0
Entering edit mode

Common column is first column, not 2nd. Correct syntax would be awk 'FNR==NR {data[$1]=$1; next} $1 in data' file1.txt file2.txt. Since both the files do not share any common values, it would be blank.

$ awk 'NR==FNR{data[$1]=$2; next}{print ($1, data[$1]) }' file1.txt file2.txt

1:838916:A:T 
1:839461:T:C 
1:839495:G:T 
1:839528:A:G 
1:839529:T:G 
1:840353:G:C 

$ awk  'FNR==NR {data[$1]=$1; next} $1 in data' file1.txt file2.txt 

No output

If I add 1:10150:C:T from column 1 of file 1, to file2, output is as follows:

$ awk 'NR==FNR{data[$1]=$2; next}{print ($1, data[$1]) }' file1.txt file2.txt

1:838916:A:T 
1:839461:T:C 
1:839495:G:T 
1:839528:A:G 
1:839529:T:G 
1:840353:G:C 
1:10150:C:T rs371194064

$ awk  'FNR==NR {data[$1]=$1; next} $1 in data' file1.txt file2.txt           
1:10150:C:T

However user wants to print matching lines from file 1, it would be:

$ awk 'NR==FNR {a[$1]; next} $1 in a {print}' file2.txt file1.txt
ADD REPLY
0
Entering edit mode
3.6 years ago
Colaptes ▴ 100

Just to check that I understand, you are trying to extract the rows from file1 that match the coordinates in file2 (rather than trying to paste the file2 rows together with the file1 rows)?

It sounds like you could do this with grep. You can give file2 to grep and then it will look through file1 to find any rows that match what is in file2:

grep -f file2.txt file1.txt > Final_rsIDs.txt
ADD COMMENT
0
Entering edit mode
3.6 years ago
Carambakaracho ★ 3.3k

You can use grep with patterns from another file input

grep -f file2.txt file1.txt

You might want to play around with the patterns like adding a ^ in front to make it more explicit that a row must start with this pattern

sed 's/^/\^/' <file2.txt >file2.mod.txt
ADD COMMENT

Login before adding your answer.

Traffic: 1931 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6