Merge 2 files based on rs number but keep all info only from the first file

0

Entering edit mode

9.8 years ago

joshf ▴ 50

I would like to merge two files by based on one column (rs number). Not all rs numbers in file 1 are in file 2, and not all in file 2 are in file 1. However, I wanna keep ALL lines from file 1 in the output.

File 1 (example):

rs    chr    bp    p
rs1    chr1    729679    0.9456
rs2    chr1    731718    0.7125
rs3    chr1    734349    0.7582
rs4    chr1    736289    0.7807
rs5    chr1    748878    0.3864

File 2 (example):

RSID CODING UTR PROMOTER DHS INTRON
rs2 0 0 0 0 1
rs3 0 0 0 0 1
rs1 0 0 1 1 0
rs4 0 0 0 1 0
rs6 0 1 0 1 0

In the case that an rs number is not found in file 2, I want to give this rs number the annotation: 0 0 0 0 0. so, hypothetically... based on my examples, this should be my output

Output (example):

rs    chr    bp    p    CODING    UTR    PROMOTER    DHS    INTRON
rs1    chr1    729679    0.9456    0    0    1    1    0
rs2    chr1    731718    0.7125    0    0    0    0    1
rs3    chr1    734349    0.7582    0    0    0    0    1
rs4    chr1    736289    0.7807    0    0    0    1    0
rs5    chr1    748878    0.3864    0    0    0    0    0

Merging files is fairly simple with a function like "join" but I have no idea how to keep the lines in file 1 without a column match with file 2, and in such a case, add the 0 0 0 0 0 annotation.

I really hope my question is clear, and any help is appreciated.

SNP awk • 5.0k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.8 years ago by joshf ▴ 50

4

Entering edit mode

9.8 years ago

Kamil ★ 2.3k

If you use Python, check out the pandas documentation for merging dataframes.

You might also consider writing a simple script:

	# https://www.biostars.org/p/162853

	cache = {}

	f2 = open('f2')
	header2 = f2.next()

	for line in f2:
	fields = line.strip().split()
	snp = fields[0]
	cache[snp] = '\t'.join(fields[1:])

	f1 = open('f1')
	header1 = f1.next()
	print header1.strip() + '\t' + '\t'.join(header2.split()[1:])

	for line in f1:
	line = line.strip()
	fields = line.split()
	snp = fields[0]
	if snp in cache:
	print '{}\t{}'.format(line, cache[snp])
	else:
	print '{}\t0\t0\t0\t0\t0'.format(line)

view raw merge.py hosted with ❤ by GitHub

ADD COMMENT • link 9.8 years ago by Kamil ★ 2.3k

3

Entering edit mode

9.8 years ago

Pierre Lindenbaum 166k

use join with option -a

join -t ' ' -1 1 -2 1 -a 1 <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt) | awk -F ' ' '{printf("%s",$0); if(NF==4) printf(" 0 0 0 0 0");printf("\n");}'

rs1 chr1 729679 0.9456 0 0 1 1 0
rs2 chr1 731718 0.7125 0 0 0 0 1
rs3 chr1 734349 0.7582 0 0 0 0 1
rs4 chr1 736289 0.7807 0 0 0 1 0
rs5 chr1 748878 0.3864 0 0 0 0 0
RSID chr bp p CODING UTR PROMOTER DHS INTRON

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Pierre Lindenbaum 166k

3

Entering edit mode

I stumbled over a join command without the need of an extra awk call:

join -a 1 -t ' ' -e '0' -o '0,1.2,1.3,1.4,2.2,2.3,2.4,2.5' <(sort -t ' ' -k1,1 input1.txt) <(sort -t ' ' -k1,1 input2.txt)

It was said that the -e (Empty) option needs -o (output sequence). The 0 in -o is the joining field, followed by the file number and its column. Thus if file number 2 has a missing entry, the -e argument is put on each field defined in -o 2.X.

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by michael.ante ★ 4.0k

0

Entering edit mode

cool, I didn't know this option -o

ADD REPLY • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

9.8 years ago

Zhilong Jia ★ 2.2k

In R with the help of dplyr package,

# 1.t and 2.t are your two files

x1 <- read.table("./1.t", header=TRUE)
x2 <- read.table("./2.t", header=TRUE)

x3 <- dplyr::left_join(x1, x2, by=c("rs"="RSID"))

# Change NA to 0
x3[whichis.na(x3), arr.ind=TRUE)]=0

Show x3:

   rs  chr     bp      p CODING UTR PROMOTER DHS INTRON
1 rs1 chr1 729679 0.9456      0   0        1   1      0
2 rs2 chr1 731718 0.7125      0   0        0   0      1
3 rs3 chr1 734349 0.7582      0   0        0   0      1
4 rs4 chr1 736289 0.7807      0   0        0   1      0
5 rs5 chr1 748878 0.3864      0   0        0   0      0

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Zhilong Jia ★ 2.2k

1

Entering edit mode

9.8 years ago

matteof ▴ 10

Really fast with Python:

D = {}
f = open(FIRST_FILE, "r")
for l in f: D[ l.split("\t")[0].strip() ] = l.strip()
f.close()

f = open(SECOND_FILE, "r")
for l in f:
    rsid = l.split("\t")[0].strip()
try:
    print l.strip() + "\t" + D[rsid]
f.close()

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by matteof ▴ 10

1

Entering edit mode

9.8 years ago

Jorge Amigo 14k

and the awk alternative

awk '{
if (/^rs[0-9]/) { if (FNR==NR) f1[$1]=$0; else f2[$1]=$0 }
} END {
for (rs in f1) {
 if (f2[rs]) { print f1[rs],f2[rs] }
 else { print f1[rs],rs,0,0,0,0,0 }
}
}' file1.txt file2.txt \
| perl -pe 's/ +/\t/g' | cut -f-4,6- | sort -V

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 9.8 years ago by Jorge Amigo 14k

Login before adding your answer.