Question

removing rows if X percentages of columns / fields are empty

0

Entering edit mode

4.6 years ago

waqasnayab ▴ 250

Hi,

I have a tabular text file with 105078 rows and 22 columns I want to remove all those lines if 50 percent of columns (columns 3 - 22) has empty fields.

I tried ISBLANK and IF function in excel, no luck. Could awk or sed help me?

Regards,

Waqas.

Edited: tags and Spelling mistake

snp next-gen awk sed • 1.0k views

ADD COMMENT • link updated 4.6 years ago by cpad0112 21k • written 4.6 years ago by waqasnayab ▴ 250

0

Entering edit mode

awk -F "\t" '{ z=0; for( i=3; i<NF; i++ ) { if($i=="") {z++} } if( z / (NF-2) < 0.5 ) {print $0} }'  test.csv  file

From https://unix.stackexchange.com/questions/363059/awk-script-to-count-the-0-in-each-row-and-print-the-rows-that-have-less-than-20

ADD REPLY • link 4.6 years ago by cpad0112 21k

score 0 · Answer 1 · 2020-05-12

0

Entering edit mode

4.6 years ago

Pierre Lindenbaum 164k

not tested:

awk '{W=0.0;for(i=3;i<=NF;i++) {if($i=="") W++;}  if(W/(NF-3) <0.5) print; }'

ADD COMMENT • link 4.6 years ago by Pierre Lindenbaum 164k

score 0 · Answer 2 · 2020-05-13

Thanks Pierre,

Yeah, I run the following command:

awk '{W=0.0;for(i=3;i<=NF;i++) {if($i=="") W++;}  if(W/(NF-3) <0.5) print; }' all_sample_nomismatch.feature

and got this error:

awk: cmd. line:1: (FILENAME=all_sample_nomismatch.feature FNR=1) fatal: division by zero attempted

My input file is without any headers, like this:

human_miRNA.mis_0+  hsa-miR-1237-5p 4   5   6   4   1   2   75  8   45  5   6                   45  12  1   25
human_miRNA.mis_0+  hsa-miR-1246-5p                                                                     1   
human_miRNA.mis_0+  hsa-miR-1249-5p                                                                     1   
human_miRNA.mis_0+  hsa-miR-1273e-3p        1                                                               
human_miRNA.mis_0+  hsa-miR-1285-1-5p   1                                                                   
human_miRNA.mis_0+  hsa-miR-151a-3p                     1                                                   
human_miRNA.mis_0+  hsa-miR-1908-5p                                                     2                   
human_miRNA.mis_0+  hsa-miR-191-5p      3       3                       6               1               14  4
human_miRNA.mis_0+  hsa-miR-1973-3p     1

The output should be the first line as it has 50 percent columns filled between columns 3 - 22. Hope it might help you further. Will update if I got luck.

Regards,

Waqas.