Question

If a string appears twice in column 1, select the higher value in column 2

0

Entering edit mode

7.0 years ago

drkennetz ▴ 560

Hi all,

I subsampled Illumina fastqs twice using 2 different random seeds. I then mapped all of the reads and got a total number of mapped reads and a mapped percent.

For the same sample, I am trying to select the higher map%, as I have found for the vast majority of my samples seqtk underestimates the actual map%.

The format of my tsv file is as follows:

sample1      200000      120000      60%
sample1      200000      115000      57.5%
sample2      200000      180000      90%
sample2      200000      190000      95%
...
sampleX      200000      180000      90%
sampleX      200000      182000      91%

I want to iterate over column 1 in the file, and select the line for data in column 3 or 4 (it doesn't matter) that is higher. So my example output from the above would be:

sample1      200000      120000      60%
sample2      200000      190000      95%
...
sampleX      200000      182000      91%

Looking forward to hearing your thoughts! Thanks

awk bash • 1.6k views

ADD COMMENT • link updated 6.9 years ago by zx8754 12k • written 7.0 years ago by drkennetz ▴ 560

0

Entering edit mode

7.0 years ago

cpad0112 21k

with datamash. Added an extra row to check for sorting.

$ sed 's/%//g' test.txt | datamash  -sfg 1  max 4| cut --complement -f5 | sed 's/$/%/g'
sample1 200000  120000  90%
sample2 200000  190000  95%

$ cat test.txt 
sample1 200000  120000  60%
sample1 200000  115000  57.5%
sample2 200000  180000  90%
sample2 200000  190000  95%
sample1 200000  120000  90%

ADD COMMENT • link 7.0 years ago by cpad0112 21k

score 4 · Accepted Answer · 2018-07-18

4

Entering edit mode

7.0 years ago

ATpoint 88k

Removing the % from $4, then sorting $4 in descending numeric order and choosing only unique IDs of $1 gives the intended output. The advantage of this over any if/else iterative script that checks the lines after the current one for other occurrences of the same $1 is that the input is not required to be sorted in any fashion:

awk '{gsub("%",""); print}' test.txt | sort -k4,4rn | sort -k1,1 -u | awk 'OFS="\t", $4 =$4"%" {print}'

ADD COMMENT • link 7.0 years ago by ATpoint 88k

0

Entering edit mode

worked like a charm! Thanks for the help.

ADD REPLY • link 7.0 years ago by drkennetz ▴ 560

0

Entering edit mode

You're very welcome!

ADD REPLY • link 7.0 years ago by ATpoint 88k