extract lines with multiple conditions on multiple fields/column
2
0
Entering edit mode
5.2 years ago
Sam ▴ 20

I have a big file contains a huge amount of SNP data, I like to search multiple columns ( here column 3 to 5) and filter any lines contain values between 0.5 to 1.

    SNP NN1 NN2 NN3
1   AG  0.0 0.0 1.0
2   CC  0.2 0.0 2.0
3   CC  0.3 0.5 0.0
4   AA  0.5 0.5 1.0
5   GG  0.4 0.0 0.0

expected output;

    SNP NN1 NN2 NN3
1   AG  0.0 0.0 1.0
3   CC  0.3 0.5 0.0
4   AA  0.5 0.5 1.0

I could find awk scripts that use a single field but not multiple field/column for query. ex; awk -F "," '$3 >= 0.5 && $3 <= 1' input.csv > out.csv

Any help will be appreciated. Thanks a lot.

AWK grep sed • 8.9k views
ADD COMMENT
1
Entering edit mode

How to add images to a Biostars post

Edit by ATpoint : You have to use the image button (right of 10101 in the format bar), pasting the full path/link to the picture including the suffix so https://i.ibb.co/92CjrYk/bst.png instead of the link you originally posted.

ADD REPLY
0
Entering edit mode

Thanks, Arup for formatting the data. Also it would be helpful to know how to paste excel data on Biostar effectively.

not sure why the data looks messy though it looks good on the preview

ADD REPLY
0
Entering edit mode

You can share data snippets in Github and link that to biostars thread.

ADD REPLY
1
Entering edit mode

Grouping the conditions and adding the "or" operator || should do the trick

awk -F "," '($3 >= 0.5 && $3 <= 1) || ($4 >= 0.5 && $4 <= 1) || ($5 >= 0.5 && $5 <= 1)' input.csv > out.csv

ADD REPLY
0
Entering edit mode

Thanks, Sus... But I have > 100 columns, is that possible to search a range of columns as a condition. for example. choose between $3 to $100

something like this ... awk -F "," '($3:$100 >= 0.5 && $3:$100 <= 1) input.csv > out.csv
ADD REPLY
1
Entering edit mode

I suggest you embed this in a for loop within awk with an option to exit the look (= go to next line) if any of the conditions is not fulfilled, basically iterating from the 3rd to the nth column.

ADD REPLY
0
Entering edit mode

ADD REPLY
7
Entering edit mode
5.2 years ago

The following awk command will loop through all the columns.

awk -F"," 'NR==1; NR > 1{for(i=2; i <= NF; i++) if($i >= 0.5 && $i <=1) {print; next}}' data.csv

Input: data.csv

SNP,NN1,NN2,NN3
AG,0.0,0.0,1.0
CC,0.2,0.0,2.0
CC,0.3,0.5,0.0
AA,0.5,0.5,1.0
GG,0.4,0.0,0.0

Output:

SNP,NN1,NN2,NN3
AG,0.0,0.0,1.0
CC,0.3,0.5,0.0
AA,0.5,0.5,1.0
ADD COMMENT

Login before adding your answer.

Traffic: 1900 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6