How to parse a data file
2
0
Entering edit mode
3.0 years ago
AP ▴ 80

Hello all,

I have a file "file1.txt " which initially looks like this,

Orthogroup      F105    F109    F23     F79     HDV247  T415
OG0006155       F105|108872
OG0006156       F105|114651
OG0006157       F105|115307
OG0006158       F105|121488
OG0006551               F109|843828
OG0006552               F109|844465
OG0006553               F109|845048
OG0006557                       F23|102768
OG0006558                       F23|106636
OG0006559                       F23|108691
OG0006560                       F23|108697
OG0006841                               F79|103483
OG0006842                               F79|103507
OG0006843                               F79|165341
OG0006844                               F79|175705
OG0006990                                       HDV247|10004
OG0006991                                       HDV247|1003
OG0006992                                       HDV247|10048
OG0006993                                       HDV247|10077
OG0006994                                       HDV247|10100
OG0006995                                       HDV247|10102
OG0008562                                               T415|110675
OG0008563                                               T415|115534

I am trying to assign a number 1 or 0 to each of these columns depending upon the genes present or absent. so The oupput would look something like this.

Orthogroup  F105    F109    F23 F79 HDV247  T415
OG0006155   1   0   0   0   0   0
OG0006156   1   0   0   0   0   0
OG0006157   1   0   0   0   0   0
OG0006158   1   0   0   0   0   0
OG0006551   0   1   0   0   0   0
OG0006552   0   1   0   0   0   0
OG0006553   0   1   0   0   0   0
OG0006557   0   0   1   0   0   0
OG0006558   0   0   1   0   0   0
OG0006559   0   0   1   0   0   0
OG0006560   0   0   1   0   0   0
OG0006841   0   0   0   1   0   0
OG0006842   0   0   0   1   0   0
OG0006843   0   0   0   1   0   0
OG0006844   0   0   0   1   0   0
OG0006990   0   0   0   0   1   0
OG0006991   0   0   0   0   1   0
OG0006992   0   0   0   0   1   0
OG0006993   0   0   0   0   1   0
OG0006994   0   0   0   0   1   0
OG0006995   0   0   0   0   1   0
OG0008562   0   0   0   0   0   1
OG0008563   0   0   0   0   0   1

So far I have been able to replace each column with 1 separately with the following code

grep "F105" File1.txt | sed 's/F105|[0-9]*/1/g' > F105.genecount  
grep "F109" file1.txt | sed 's/F109|[0-9]*/1/g' > F109.genecount

which assigns number "1" to genes if present. The problem with this is I have to make multiple files and when I concatenate each files to one at the end it is giving the number "1" in second column only not in their respective columns. How can I get the desired output in a neat way. Please help.

awk sed grep • 1.1k views
ADD COMMENT
3
Entering edit mode
3.0 years ago
JC 13k

Perl-one-liner:

$ perl -lane '@a=split(/\t/, $_); for($i=1;$i<=6;$i++){ unless (/^Ortho/) { if($a[$i]=~/\w+/) {$a[$i]=1} else {$a[$i]=0} } }; print join "\t", @a;' < input.tsv 
Orthogroup  F105    F109    F23 F79 HDV247  T415
OG0006155   1   0   0   0   0   0
OG0006156   1   0   0   0   0   0
OG0006157   1   0   0   0   0   0
OG0006158   1   0   0   0   0   0
OG0006551   0   1   0   0   0   0
OG0006552   0   1   0   0   0   0
OG0006553   0   1   0   0   0   0
OG0006557   0   0   1   0   0   0
OG0006558   0   0   1   0   0   0
OG0006559   0   0   1   0   0   0
OG0006560   0   0   1   0   0   0
OG0006841   0   0   0   1   0   0
OG0006842   0   0   0   1   0   0
OG0006843   0   0   0   1   0   0
OG0006844   0   0   0   1   0   0
OG0006990   0   0   0   0   1   0
OG0006991   0   0   0   0   1   0
OG0006992   0   0   0   0   1   0
OG0006993   0   0   0   0   1   0
OG0006994   0   0   0   0   1   0
OG0006995   0   0   0   0   1   0
OG0008562   0   0   0   0   0   1
OG0008563   0   0   0   0   0   1
ADD COMMENT
0
Entering edit mode

Thank you JC this code works!!!

ADD REPLY
1
Entering edit mode
3.0 years ago

something like (not tested)

awk '(NR==1) {print;next} {printf("%s",$1); for(i=2;i<=NF;i++) {printf("\t%s",($i==""?"0":"1"));} printf("\n");}' input.tsv
ADD COMMENT
0
Entering edit mode

@Pierre I was able to run this code but the last column has the error.

Orthogroup      F105    F109    F23     F79     HDV247  T415
OG0006153       1       0       0       0       0       1
OG0006154       1       0       0       0       0       1
OG0006155       1       0       0       0       0       1
OG0006156       1       0       0       0       0       1
OG0006157       1       0       0       0       0       1
OG0006158       1       0       0       0       0       1
OG0006159       1       0       0       0       0       1
OG0006160       1       0       0       0       0       1
OG0006161       1       0       0       0       0       1
OG0006162       1       0       0       0       0       1
OG0006163       1       0       0       0       0       1
OG0006164       1       0       0       0       0       1
OG0006165       1       0       0       0       0       1
OG0006166       1       0       0       0       0       1

Its printing 1 for every line.

ADD REPLY

Login before adding your answer.

Traffic: 1905 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6