Question

How to parse a data file

0

Entering edit mode

3.7 years ago

AP ▴ 80

Hello all,

I have a file "file1.txt " which initially looks like this,

Orthogroup      F105    F109    F23     F79     HDV247  T415
OG0006155       F105|108872
OG0006156       F105|114651
OG0006157       F105|115307
OG0006158       F105|121488
OG0006551               F109|843828
OG0006552               F109|844465
OG0006553               F109|845048
OG0006557                       F23|102768
OG0006558                       F23|106636
OG0006559                       F23|108691
OG0006560                       F23|108697
OG0006841                               F79|103483
OG0006842                               F79|103507
OG0006843                               F79|165341
OG0006844                               F79|175705
OG0006990                                       HDV247|10004
OG0006991                                       HDV247|1003
OG0006992                                       HDV247|10048
OG0006993                                       HDV247|10077
OG0006994                                       HDV247|10100
OG0006995                                       HDV247|10102
OG0008562                                               T415|110675
OG0008563                                               T415|115534

I am trying to assign a number 1 or 0 to each of these columns depending upon the genes present or absent. so The oupput would look something like this.

Orthogroup  F105    F109    F23 F79 HDV247  T415
OG0006155   1   0   0   0   0   0
OG0006156   1   0   0   0   0   0
OG0006157   1   0   0   0   0   0
OG0006158   1   0   0   0   0   0
OG0006551   0   1   0   0   0   0
OG0006552   0   1   0   0   0   0
OG0006553   0   1   0   0   0   0
OG0006557   0   0   1   0   0   0
OG0006558   0   0   1   0   0   0
OG0006559   0   0   1   0   0   0
OG0006560   0   0   1   0   0   0
OG0006841   0   0   0   1   0   0
OG0006842   0   0   0   1   0   0
OG0006843   0   0   0   1   0   0
OG0006844   0   0   0   1   0   0
OG0006990   0   0   0   0   1   0
OG0006991   0   0   0   0   1   0
OG0006992   0   0   0   0   1   0
OG0006993   0   0   0   0   1   0
OG0006994   0   0   0   0   1   0
OG0006995   0   0   0   0   1   0
OG0008562   0   0   0   0   0   1
OG0008563   0   0   0   0   0   1

So far I have been able to replace each column with 1 separately with the following code

grep "F105" File1.txt | sed 's/F105|[0-9]*/1/g' > F105.genecount  
grep "F109" file1.txt | sed 's/F109|[0-9]*/1/g' > F109.genecount

which assigns number "1" to genes if present. The problem with this is I have to make multiple files and when I concatenate each files to one at the end it is giving the number "1" in second column only not in their respective columns. How can I get the desired output in a neat way. Please help.

awk sed grep • 1.6k views

ADD COMMENT • link 3.7 years ago by AP ▴ 80

1

Entering edit mode

3.7 years ago

Pierre Lindenbaum 166k

something like (not tested)

awk '(NR==1) {print;next} {printf("%s",$1); for(i=2;i<=NF;i++) {printf("\t%s",($i==""?"0":"1"));} printf("\n");}' input.tsv

ADD COMMENT • link 3.7 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

@Pierre I was able to run this code but the last column has the error.

Orthogroup      F105    F109    F23     F79     HDV247  T415
OG0006153       1       0       0       0       0       1
OG0006154       1       0       0       0       0       1
OG0006155       1       0       0       0       0       1
OG0006156       1       0       0       0       0       1
OG0006157       1       0       0       0       0       1
OG0006158       1       0       0       0       0       1
OG0006159       1       0       0       0       0       1
OG0006160       1       0       0       0       0       1
OG0006161       1       0       0       0       0       1
OG0006162       1       0       0       0       0       1
OG0006163       1       0       0       0       0       1
OG0006164       1       0       0       0       0       1
OG0006165       1       0       0       0       0       1
OG0006166       1       0       0       0       0       1

Its printing 1 for every line.

ADD REPLY • link 3.7 years ago by AP ▴ 80

score 3 · Accepted Answer · 2021-11-10

Perl-one-liner:

$ perl -lane '@a=split(/\t/, $_); for($i=1;$i<=6;$i++){ unless (/^Ortho/) { if($a[$i]=~/\w+/) {$a[$i]=1} else {$a[$i]=0} } }; print join "\t", @a;' < input.tsv 
Orthogroup  F105    F109    F23 F79 HDV247  T415
OG0006155   1   0   0   0   0   0
OG0006156   1   0   0   0   0   0
OG0006157   1   0   0   0   0   0
OG0006158   1   0   0   0   0   0
OG0006551   0   1   0   0   0   0
OG0006552   0   1   0   0   0   0
OG0006553   0   1   0   0   0   0
OG0006557   0   0   1   0   0   0
OG0006558   0   0   1   0   0   0
OG0006559   0   0   1   0   0   0
OG0006560   0   0   1   0   0   0
OG0006841   0   0   0   1   0   0
OG0006842   0   0   0   1   0   0
OG0006843   0   0   0   1   0   0
OG0006844   0   0   0   1   0   0
OG0006990   0   0   0   0   1   0
OG0006991   0   0   0   0   1   0
OG0006992   0   0   0   0   1   0
OG0006993   0   0   0   0   1   0
OG0006994   0   0   0   0   1   0
OG0006995   0   0   0   0   1   0
OG0008562   0   0   0   0   0   1
OG0008563   0   0   0   0   0   1