Hi,
I have a file (file1.txt):
AHR_si liver
AHR_si liver
AHR_si liver
AHR_si large_intestine
AHR_si liver
AHR_si large_intestine
AHR_si liver
AHR_si skin
AHR_si liver
AHR_si pancreas
then the file continues.......
I want to count the number of occurrences of cervix (not shown here as the file1.txt is showing as head -n10) appears against each column1. So:
grep cervix file1.txt | uniq -c
1 AIRE_f2 cervix
1 ARI3A_do cervix
1 FOSB_f1 cervix
1 FOXQ1_f1 cervix
1 HEN1_si cervix
1 HNF4G_f1 cervix
1 JUNB_f1 cervix
1 NFAC1_do cervix
1 NR2F6_f1 cervix
1 PTF1A_f1 cervix
1 ZN350_f1 cervix
The above is the total output. As you can see there is not even a single occurrence of AHR_si with cervix. But I still I want the output like this:
0 AHR_si cervix
1 AIRE_f2 cervix
1 ARI3A_do cervix
1 FOSB_f1 cervix
1 FOXQ1_f1 cervix
1 HEN1_si cervix
1 HNF4G_f1 cervix
1 JUNB_f1 cervix
1 NFAC1_do cervix
1 NR2F6_f1 cervix
1 PTF1A_f1 cervix
1 ZN350_f1 cervix
There are many other column1 value where cervix did not matched. So, I want in my output those lines as well with occurrence of zero.
Thanks in advance,
Waqas.
OK, so what have you tried other than command line tools, which won't do what you want?
I tried only the command line.
Use R, in particular the dplyr example that Giovanni posted.
Never do a uniq without a sort first. grep | sort | uniq