How to add column to a file from from another file having one common column but different number of rows
3
0
Entering edit mode
7.0 years ago
AP ▴ 80

Hello everyone,

I have File 1 with 6 columns:

**geneId** "\t" **scaffold** "\t"  **LocusID** "\t" **pfam** "\t" **PfamID** "\t" **function**

1     "\t"     aa   "\t"  XL_012 "\t" pfam "\t" P1234 "\t" .............  

2      "\t"  bb "\t" XL_013 "\t" pfam "\t" P1324"\t ...........

3  "\t"  cc "\t" XL_014 "\t" pfam "\t" P1624"\t ...........

3"\t"  cc "\t" XL_014 "\t" pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........

4  "\t"  dd "\t" XL_015 "\t" pfam "\t" P1326"\t ...........  

5 "\t" ee"\t" XL_016

File 2 with 2 columns

**LocusID** "\t" **Expression**

XL_012  "\t"  23

XL_013 "\t"  20

XL_014 "\t"  15

XL_015 "\t"  10

XL_016 "\t"  9

Desired output : I want to merge column expression to file1 based on locus ID:

**geneId** "\t" **scaffold** "\t"  **LocusID** "\t" **Expression** "\t" **pfam** "\t" **PfamID** "\t" **function**

1     "\t"     aa   "\t"  XL_012 "\t"  23 "\t"  pfam "\t" P1234 "\t" .............  

2      "\t"  bb "\t" XL_013 "\t" 20 "\t" pfam "\t" P1324"\t ...........

3  "\t"  cc "\t" XL_014 "\t" 15 "\t" pfam "\t" P1624"\t ...........

3"\t"  cc "\t" XL_014 "\t" 15 "\t"pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4"\t"  dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4  "\t"  dd "\t" XL_015 "\t"  10 "\t" pfam "\t" P1326"\t ...........  

5 "\t" ee"\t" XL_016 "\t"9

I could do this easily if I had same number of rows for both files but here since I have different number of rows and some repetitions of locus ID due to many pfam domains for single gene I am super confused. Please help me figure out this one using awk, perl or bash. Many thanks in Advance!!!!

Thank you,

Ambika

awk text_processing RNA-Seq • 3.0k views
ADD COMMENT
0
Entering edit mode

Man! Where does this come from? I am eager to know ;)

ADD REPLY
1
Entering edit mode
7.0 years ago

I wasn't sure of the purpose of all of the tab-spaces and quotation marks were.. so I first removed them with sed commands.

sed 's/ \+//g' file1.txt | sed 's/"\\t"/\t/g' | sed 's/"\\t//g' | sed '/^\s*$/d' | sed 's/[\\.]*$//g' | cut -f1-5 > file1v2.txt 
sed 's/ \+//g' file2.txt | sed 's/"\\t"/\t/g' | sed '/^\s*$/d' > file2v2.txt

cat file1v2.txt
**geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
1   aa  XL_012  pfam    P1234
2   bb  XL_013  pfam    P1324
3   cc  XL_014  pfam    P1624
3   cc  XL_014  pfam    P1326
4   dd  XL_015  pfam    P1326
4   dd  XL_015  pfam    P1326
4   dd  XL_015  pfam    P1326
5   ee  XL_016

cat file2v2.txt
**LocusID** **Expression**
XL_012  23
XL_013  20
XL_014  15
XL_015  10
XL_016  9


awk 'BEGIN {FS="\t"} FNR==NR {key=$3; arrayLookup[key]=$0; next} {key=$1; if (arrayLookup[key]) print $2"\t"arrayLookup[key]}' file1v2.txt file2v2.txt 
**Expression**  **geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
23              1           aa              XL_012      pfam        P1234
20              2           bb              XL_013      pfam        P1324
15              3           cc              XL_014      pfam        P1326
10              4           dd              XL_015      pfam        P1326
9               5           ee              XL_016
ADD COMMENT
0
Entering edit mode

Hello Kevin, I tried your codes however it is not giving me any output. Does it make any difference if I don't have first heading lines?

ADD REPLY
0
Entering edit mode

Thank you so much kevin, It worked now but I only problem I have is number of rows for output file, which is less than file 1. I want the same number of rows as file1 because some of pfam domains are unique even though they are from same loci.

Please help and many thanks.

ADD REPLY
1
Entering edit mode

Oh, I see! Try this (it may require another modification):

awk 'BEGIN {FS="\t"} FNR==NR {key=$1; arrayLookup[key]=$0; next} {key=$3; if (arrayLookup[key]) print arrayLookup[key]"\t"$0}' file2v2.txt file1v2.txt
**LocusID** **Expression**  **geneId**  **scaffold**    **LocusID** **pfam**    **PfamID**
XL_012      23              1           aa              XL_012      pfam        P1234
XL_013      20              2           bb              XL_013      pfam        P1324
XL_014      15              3           cc              XL_014      pfam        P1624
XL_014      15              3           cc              XL_014      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_015      10              4           dd              XL_015      pfam        P1326
XL_016       9              5           ee              XL_016
ADD REPLY
1
Entering edit mode

Thank you so much for your help Kevin!!!!

ADD REPLY
0
Entering edit mode
7.0 years ago
venu 7.1k

Not a bioinformatics question. You can try the following

join <(cat file1.txt | awk -F'\t' '{print $2 "\t" $0}' | sort) <(cat file2.txt | sort) | tr ' ' '\t' | cut -f1 --complement

P.S. I didn't test it as your example is very hard to reproduce. I guess, you need to select columns properly with cut command

ADD COMMENT
0
Entering edit mode

I tried but this is not giving me any output and also I want the lines to be in same order.

ADD REPLY
0
Entering edit mode

post the out put of following commands

head file1.txt file2.txt
ADD REPLY
0
Entering edit mode

head file1.txt:

XL_012 "\t" 23

XL_013 "\t" 20

XL_014 "\t" 15

XL_015 "\t" 10

XL_016 "\t" 9

head file2.txt:

1 "\t" aa "\t" XL_012 "\t" 23 "\t" pfam "\t" P1234 "\t" .............

2 "\t" bb "\t" XL_013 "\t" 20 "\t" pfam "\t" P1324"\t ...........

3 "\t" cc "\t" XL_014 "\t" 15 "\t" pfam "\t" P1624"\t ...........

3"\t" cc "\t" XL_014 "\t" 15 "\t"pfam "\t" P1326"\t ...........

4"\t" dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

4"\t" dd "\t" XL_015 "\t" 10 "\t"pfam "\t" P1326"\t ...........

ADD REPLY
0
Entering edit mode

What kind of a format is this? "\t" should mean a tab separated file but your file contains \t' instead oftab spaces`. Please try my answer on a tab separated file. It should work.

ADD REPLY
0
Entering edit mode

@venu It is a tab separated file. I wrote it like that so it would be easy. I will try and keep you updated. Thanks

ADD REPLY
0
Entering edit mode

You don't need to add anything extra while posting some example data. Just paste few lines as it is and just mention it is a tab / space separated file.

ADD REPLY
0
Entering edit mode

If you are familiar with R, the table joining functions in the dplyr package may be helpful.

ADD REPLY
0
Entering edit mode

@mastal511, I am not an expert but I know little bit of it. Could you suggest what kind of scripts I can use?

ADD REPLY
0
Entering edit mode

Have a look at this tutorial for the join functions:

http://stat545.com/bit001_dplyr-cheatsheet.html

ADD REPLY
0
Entering edit mode
7.0 years ago

output (remove uniq if output requires duplicated lines):

 $ join -t$'\t' -o 1.1,1.2,1.3,2.2,1.4,1.5  --header -1 3 -2 1 <(sort -k3 test1.txt|uniq -u) <(sort -k1 test2.txt) 

geneId  scaffold    LocusID Expression  pfam    PfamID
1   aa  XL_012  23  pfam    P1234
2   bb  XL_013  20  pfam    P1324
3   cc  XL_014  15  pfam    P1326
3   cc  XL_014  15  pfam    P1624
4   dd  XL_015  10  pfam    P1326

better output:

$ join -t$'\t' -o 1.2,1.3,1.1,2.2,1.4,1.5  --header -1 1 -2 1 <(datamash -g 3 unique 1,2,4,5 < test1.txt) <(sort -k1 test2.txt)

geneId  scaffold    LocusID Expression  pfam    PfamID
1   aa  XL_012  23  pfam    P1234
2   bb  XL_013  20  pfam    P1324
3   cc  XL_014  15  pfam    P1326,P1624
4   dd  XL_015  10  pfam    P1326

After cleaning input:

$ cat test1.txt 
geneId  scaffold    LocusID pfam    PfamID  
1   aa  XL_012  pfam    P1234   
2   bb  XL_013  pfam    P1324   
3   cc  XL_014  pfam    P1624   
3   cc  XL_014  pfam    P1326   
4   dd  XL_015  pfam    P1326   
4   dd  XL_015  pfam    P1326   
4   dd  XL_015  pfam    P1326

$ cat test2.txt 
cusID   Expression  
XL_012  23
XL_013  20
XL_014  15
XL_015  10
XL_016  9
ADD COMMENT
0
Entering edit mode

Thank you for your suggestions but I don't know why it is not recognizing datamash in my system. It says it is not found.

ADD REPLY
0
Entering edit mode

If you are using Ubuntu or mint or Debian, sudo apt-get install datamash -y should install the datamash package.

ADD REPLY

Login before adding your answer.

Traffic: 2662 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6