Merge 2 files based on rs number but keep all info only from the first file
5
0
Entering edit mode
9.2 years ago
joshf ▴ 50

I would like to merge two files by based on one column (rs number). Not all rs numbers in file 1 are in file 2, and not all in file 2 are in file 1. However, I wanna keep ALL lines from file 1 in the output.

File 1 (example):

rs    chr    bp    p
rs1    chr1    729679    0.9456
rs2    chr1    731718    0.7125
rs3    chr1    734349    0.7582
rs4    chr1    736289    0.7807
rs5    chr1    748878    0.3864

File 2 (example):

RSID CODING UTR PROMOTER DHS INTRON
rs2 0 0 0 0 1
rs3 0 0 0 0 1
rs1 0 0 1 1 0
rs4 0 0 0 1 0
rs6 0 1 0 1 0

In the case that an rs number is not found in file 2, I want to give this rs number the annotation: 0 0 0 0 0. so, hypothetically... based on my examples, this should be my output

Output (example):

rs    chr    bp    p    CODING    UTR    PROMOTER    DHS    INTRON
rs1    chr1    729679    0.9456    0    0    1    1    0
rs2    chr1    731718    0.7125    0    0    0    0    1
rs3    chr1    734349    0.7582    0    0    0    0    1
rs4    chr1    736289    0.7807    0    0    0    1    0
rs5    chr1    748878    0.3864    0    0    0    0    0

Merging files is fairly simple with a function like "join" but I have no idea how to keep the lines in file 1 without a column match with file 2, and in such a case, add the 0 0 0 0 0 annotation.

I really hope my question is clear, and any help is appreciated.

SNP awk • 4.7k views
ADD COMMENT
4
Entering edit mode
9.2 years ago
Kamil ★ 2.3k

If you use Python, check out the pandas documentation for merging dataframes.

You might also consider writing a simple script:

ADD COMMENT
3
Entering edit mode
9.2 years ago

use join with option -a

join -t ' ' -1 1 -2 1 -a 1 <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt) | awk -F ' ' '{printf("%s",$0); if(NF==4) printf(" 0 0 0 0 0");printf("\n");}'

rs1 chr1 729679 0.9456 0 0 1 1 0
rs2 chr1 731718 0.7125 0 0 0 0 1
rs3 chr1 734349 0.7582 0 0 0 0 1
rs4 chr1 736289 0.7807 0 0 0 1 0
rs5 chr1 748878 0.3864 0 0 0 0 0
RSID chr bp p CODING UTR PROMOTER DHS INTRON
ADD COMMENT
3
Entering edit mode

I stumbled over a join command without the need of an extra awk call:

join -a 1 -t ' ' -e '0' -o '0,1.2,1.3,1.4,2.2,2.3,2.4,2.5' <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt)

It was said that the -e (Empty) option needs -o (output sequence). The 0 in -o is the joining field, followed by the file number and its column. Thus if file number 2 has a missing entry, the -e argument is put on each field defined in -o 2.X.

ADD REPLY
0
Entering edit mode

cool, I didn't know this option -o

ADD REPLY
1
Entering edit mode
9.2 years ago
Zhilong Jia ★ 2.2k

In R with the help of dplyr package,

# 1.t and 2.t are your two files

x1 <- read.table("./1.t", header=TRUE)
x2 <- read.table("./2.t", header=TRUE)

x3 <- dplyr::left_join(x1, x2, by=c("rs"="RSID"))

# Change NA to 0
x3[whichis.na(x3), arr.ind=TRUE)]=0

Show x3:

   rs  chr     bp      p CODING UTR PROMOTER DHS INTRON
1 rs1 chr1 729679 0.9456      0   0        1   1      0
2 rs2 chr1 731718 0.7125      0   0        0   0      1
3 rs3 chr1 734349 0.7582      0   0        0   0      1
4 rs4 chr1 736289 0.7807      0   0        0   1      0
5 rs5 chr1 748878 0.3864      0   0        0   0      0
ADD COMMENT
1
Entering edit mode
9.2 years ago
matteof ▴ 10

Really fast with Python:

D = {}
f = open(FIRST_FILE, "r")
for l in f: D[ l.split("\t")[0].strip() ] = l.strip()
f.close()

f = open(SECOND_FILE, "r")
for l in f:
    rsid = l.split("\t")[0].strip()
try:
    print l.strip() + "\t" + D[rsid]
f.close()
ADD COMMENT
1
Entering edit mode
9.2 years ago

and the awk alternative

awk '{
if (/^rs[0-9]/) { if (FNR==NR) f1[$1]=$0; else f2[$1]=$0 }
} END {
for (rs in f1) {
 if (f2[rs]) { print f1[rs],f2[rs] }
 else { print f1[rs],rs,0,0,0,0,0 }
}
}' file1.txt file2.txt \
| perl -pe 's/ +/\t/g' | cut -f-4,6- | sort -V
ADD COMMENT

Login before adding your answer.

Traffic: 1930 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6