Hi everybody,
I have a text file with 19 columns (divided by tab) which I have sorted using a command such as:
sort -t$'\t' -k1,1 -k11,11g -k12,12gr -k3,3g file > file_sorted
Now I would like to keep the top X number of lines per unique value in column 1. I know that if I do:
sort -u -k1,1 --merge file_sorted > file_sorted_merged
I will keep only the 1st line for each unique value in column 1. How can I keep the top X (for example, the top 5) lines for the same value in column 1 from the sorted file?
Thanks a lot in advance
EDIT: You should edit your question and add how this is related to bioinformatics, or the post might be closed as off-topic.
Either switch to something with more in-memory state, like R or python, or use sub-shells. The sub-shell will pick X unique values per column and then you can use awk to pick N matches per input line from the sub-shell.
There would be an awful lot of trial and error and column-specific wrangling if you use awk, so I'd recommend using R.