removing rows if X percentages of columns / fields are empty
2
0
Entering edit mode
4.6 years ago
waqasnayab ▴ 250

Hi,

I have a tabular text file with 105078 rows and 22 columns I want to remove all those lines if 50 percent of columns (columns 3 - 22) has empty fields.

I tried ISBLANK and IF function in excel, no luck. Could awk or sed help me?

Regards,

Waqas.

Edited: tags and Spelling mistake

snp next-gen awk sed • 1.0k views
ADD COMMENT
0
Entering edit mode
awk -F "\t" '{ z=0; for( i=3; i<NF; i++ ) { if($i=="") {z++} } if( z / (NF-2) < 0.5 ) {print $0} }'  test.csv  file

From https://unix.stackexchange.com/questions/363059/awk-script-to-count-the-0-in-each-row-and-print-the-rows-that-have-less-than-20

ADD REPLY
0
Entering edit mode
4.6 years ago

not tested:

awk '{W=0.0;for(i=3;i<=NF;i++) {if($i=="") W++;}  if(W/(NF-3) <0.5) print; }'
ADD COMMENT
0
Entering edit mode
4.6 years ago
waqasnayab ▴ 250

Thanks Pierre,

Yeah, I run the following command:

awk '{W=0.0;for(i=3;i<=NF;i++) {if($i=="") W++;}  if(W/(NF-3) <0.5) print; }' all_sample_nomismatch.feature

and got this error:

awk: cmd. line:1: (FILENAME=all_sample_nomismatch.feature FNR=1) fatal: division by zero attempted

My input file is without any headers, like this:

human_miRNA.mis_0+  hsa-miR-1237-5p 4   5   6   4   1   2   75  8   45  5   6                   45  12  1   25
human_miRNA.mis_0+  hsa-miR-1246-5p                                                                     1   
human_miRNA.mis_0+  hsa-miR-1249-5p                                                                     1   
human_miRNA.mis_0+  hsa-miR-1273e-3p        1                                                               
human_miRNA.mis_0+  hsa-miR-1285-1-5p   1                                                                   
human_miRNA.mis_0+  hsa-miR-151a-3p                     1                                                   
human_miRNA.mis_0+  hsa-miR-1908-5p                                                     2                   
human_miRNA.mis_0+  hsa-miR-191-5p      3       3                       6               1               14  4
human_miRNA.mis_0+  hsa-miR-1973-3p     1

The output should be the first line as it has 50 percent columns filled between columns 3 - 22. Hope it might help you further. Will update if I got luck.

Regards,

Waqas.

ADD COMMENT
0
Entering edit mode

This worked for me:

awk '(NF==12){print}' all_sample_nomismatch.feature

Now, I have all those columns (including columns 1 and 2 must, and then 3 to 22) where 3 to 22 fields are 50 percent filled. Hope I did correctly.

Regards,

Waqas.

ADD REPLY
0
Entering edit mode

change W/(NF-3) to W/(NF-2)

ADD REPLY

Login before adding your answer.

Traffic: 2288 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6