check if two columns match in 2 annotation files and print those lines to a new output file
2
0
Entering edit mode
9.6 years ago

I have 2 gff3 files. I want to extract genes with same gene lengths in both the files. If they are same, I want to write them to a third file

example:

file 1

C20775336       maker   gene    1895    2166    .       -       .       ID=gene1;Name=maker-C20
C20775336       maker   gene    895    1166    .       -       .       ID=mRNA1;Parent=gene1;N
C20775336       maker   gene    1895    1962    .       -       .       Parent=mRNA1
C20775336       maker   gene     2795    2962    .       -       2       ID=CDS1;Parent=mRNA1

file 2

C20775336       maker   gene    895    1166    .       -       .       ID=gene1;Name=maker-C20
C20775936       maker   gene    1895    2166    .       -       .       ID=mRNA1;Parent=gene1;N
C20775336       maker   gene    2795    2962    .       -       .       Parent=mRNA1

Output file

C20775336       maker   gene    895    1166    .       -       .       ID=gene1;Name=maker-C20
C20775336       maker   gene    2795    2962    .       -       .       Parent=mRNA1

So if columns 4 and 5 in first file is equal to column 4 and 5 in second file, print that line from second file to a new output file.

I tried to work on awk, to achieve this, but I could not figure out.

Thanks in advance

sequence • 23k views
ADD COMMENT
0
Entering edit mode

Joining Multiple Fields Using Unix Join (from SO) gives very nice solutions using awk and join. If I have to combine multiple files in unix I use what Michael Mrozek suggested - combine fields using awk and then join files according one key field with join.

ADD REPLY
0
Entering edit mode

I have same query also, but if i don't care about rest of first column, only first column should match and print number of match found in first file and number of match found in second file.

thanks

ADD REPLY
0
Entering edit mode

For quicker response you should ask this general programming question on SO.

ADD REPLY
0
Entering edit mode

Sorry...I did not get about SO

ADD REPLY
0
Entering edit mode
ADD REPLY
6
Entering edit mode
9.6 years ago
venu 7.1k

Is this what you need?

​awk 'FNR==NR{a[$4,$5]=$0;next}{if(b=a[$4,$5]){print b}}' file1 file2
ADD COMMENT
0
Entering edit mode

+1 for a pure Awk solution (and a creative usage of FNR). Would you care to explain what the command does for people who are not familiar with Awk?

ADD REPLY
0
Entering edit mode

For some reason, the above command does not work for me, I get a error message bash: ​awk: command not found..

ADD REPLY
0
Entering edit mode

I tried to work on awk.

This statement from your question letting us to assume you have used awk on your machine initially, if so the above one-liner should work. Type awk -Vin terminal.

ADD REPLY
0
Entering edit mode

@Fr&édéric Mahé Two variables FNRand NRstores the number of records in the current file and total number of records respectively. $0 is to hold entire line. Rest is as simple as any other comparison, matching column 4 and 5 from file1 (a) with file2 (b), if matched, print entire line from file2 (b).

ADD REPLY
0
Entering edit mode

Hi, what if I need the inverse of this? I tried awk 'FNR==NR{a[$4,$5]=$0;next}{if(b!=a[$4,$5]){print b}}' file1 file2 but all I got was an empty file.

ADD REPLY
0
Entering edit mode
9.6 years ago
mark.ziemann ★ 1.9k

Try this:

##First sort each of the input files on the 1st column
sort -k 1b,1 file2 > file2s
sort -k 1b,1 file1 > file1s

##Now they can be joined on the 1st field, followed by awk 1 line
join -1 1 -2 1 file1s file2s \
| awk '{OFS="\t"} $5-$4==$13-$12' > outputfile
ADD COMMENT
0
Entering edit mode

Thank you, your solution works perfectly. However it joins the lines from both the files. Is there a way to skip that?

For example, it did some thing like this on my original file

scaffold1732 baker gene 85677 87592 . + . ID=gene9;Name=maker-gnlscaffold1732-augustus-gene-0.6 maker gene 85677 87592 . + . ID=gene9;Name=maker-gnl|Hsal_3.3|scaffold1732-augustus-gene-0.6

instead can I get output lines like this

scaffold1732 baker gene 85677 87592 . + . ID=gene9;Name=maker-
ADD REPLY
0
Entering edit mode

With join you can specify wanted output (simple examples).

ADD REPLY

Login before adding your answer.

Traffic: 2040 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6