Get the top X number of lines per unique value in one column, once you've sorted a text file using 'sort'
1
0
Entering edit mode
3.9 years ago

Hi everybody,

I have a text file with 19 columns (divided by tab) which I have sorted using a command such as:

sort -t$'\t' -k1,1 -k11,11g -k12,12gr -k3,3g file > file_sorted

Now I would like to keep the top X number of lines per unique value in column 1. I know that if I do:

sort -u -k1,1 --merge file_sorted > file_sorted_merged

I will keep only the 1st line for each unique value in column 1. How can I keep the top X (for example, the top 5) lines for the same value in column 1 from the sorted file?

Thanks a lot in advance

bash shell sort • 1.3k views
ADD COMMENT
0
Entering edit mode

EDIT: You should edit your question and add how this is related to bioinformatics, or the post might be closed as off-topic.

Either switch to something with more in-memory state, like R or python, or use sub-shells. The sub-shell will pick X unique values per column and then you can use awk to pick N matches per input line from the sub-shell.

There would be an awful lot of trial and error and column-specific wrangling if you use awk, so I'd recommend using R.

ADD REPLY
5
Entering edit mode
3.9 years ago

I got a tool csvtk, the uniq command can do exactly what you want , check the last example.

csvtk uniq -t -f 1 -n 5

The behind logic is easy, use a map/hash-table (column value -> count) to track how many times you have met a row with cerntain value in the column you care. If <= N, print this line.

ADD COMMENT
0
Entering edit mode

Cool! it does exactly what I was looing for! thanks a lot

ADD REPLY
0
Entering edit mode

I've moved shenwei's comment to an answer. Please accept it so the post is marked as solved.

ADD REPLY

Login before adding your answer.

Traffic: 1105 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6