curating file with sort(largest to smallest) and then extract unique values
3
0
Entering edit mode
7.7 years ago

i have a file with library-ID-count as follows: `*

Searching cagcaccaccaagauucacau*            
CC2_B ta_iwgsc_7bs_v1_3148968_39029 33  
CC2_B ta_iwgsc_7bs_v1_3150171_39041 38  
CC2_D ta_iwgsc_7ds_v1_3966917_41463 156 
CC3_B ta_iwgsc_7bs_v1_3148968_41273 56
CC2_A ta_iwgsc_6al_v1_5830987_31258 18  
CC2_B ta_iwgsc_6bl_v1_4279451_30909 18  
CC2_D ta_iwgsc_6dl_v1_3311975_32342 18  
CI2_A ta_iwgsc_6al_v1_5830987_27002 30  
CI2_B ta_iwgsc_6bl_v1_4279451_26849 30  
CI2_D ta_iwgsc_6dl_v1_3311975_28474 30  
*Found(s) in 6 file(s)*         

*Searching ugccuggcucccugaaugcca*
CC2_B ta_iwgsc_6bs_v1_1636307_32644 3275    
CC2_B ta_iwgsc_6bs_v1_1636307_32645 3575
CC3_B ta_iwgsc_6bs_v1_1636307_34610 3449    
CI1_B ta_iwgsc_6bs_v1_1636307_28706 3509            
CC2_A ta_iwgsc_7as_v1_4255214_39664 1809    
CC2_B ta_iwgsc_7bs_v1_3149865_39035 1809    
CC2_D ta_iwgsc_7ds_v1_3850348_38998 1809    
*Found(s) in 3 file(s)*

` i want each library(CC1_A,CC2_B ETC) to have its highest count,as you can see the counts differ for same library. and print each library with following format for each seperate block(paragraph) :

'Searching cagcaccaccaagauucacau:
CC2_B 38
CC2_D 156
CC3_B 56
CC2_A 18
CI1_A 30
CI1_B 30
CI1_D 30
Searching ugccuggcucccugaaugcca:
CC2_B 3575
CC3_B  3449 
CI1_B 3509          
CC2_A 1809  
CC2_B 1809  
CC2_D 1809 '
next-gen • 1.6k views
ADD COMMENT
0
Entering edit mode

please click button 101010 to format code or file content!

ADD REPLY
0
Entering edit mode
ADD REPLY
2
Entering edit mode
7.7 years ago

With the help of rush (parallelly execute shell commands. A GNU parallel like tool in Go. It supports Linux/OS X/Windows!) Love it so much...

$ cat d.txt | rush -d "\n" -D "file(s)*" -T b \
    'echo "{1}"; \
    echo "{}" | sed 1d | sed "$ d" | \
        sort -k 1,1 -k 3,3nr | sort -k 1,1 -u | cut -d " " -f 1,3;\
    echo '

*Searching cagcaccaccaagauucacau*            
CC2_A 18
CC2_B 38
CC2_D 156
CC3_B 56
CI2_A 30
CI2_B 30
CI2_D 30

*Searching ugccuggcucccugaaugcca*
CC2_A 1809
CC2_B 3575
CC2_D 1809
CC3_B 3449
CI1_B 3509

Limitation: it may fail due to limit of parameters of echo when the content of a block (your library) is too long.

ADD COMMENT
0
Entering edit mode

thanks alot its working fine

ADD REPLY
0
Entering edit mode

Since you write that this answer is working you should accept it since it solves your question.
I have now accepted this answer as accepted, but please keep this in mind for next time when people are spending time to help you out.

ADD REPLY
0
Entering edit mode

i tried it and tht wrks its good i got my solution and people helped me.but i have also tried it my way as i should also learn by myself how to slove a difficulty and then i gave it a try .. i think thts a positive attitude and no harm to anyone.thnks

ADD REPLY
1
Entering edit mode

Fixing things on your own is great, but preferably you should try that before opening a question. And people who spend time helping you with a working solution also deserve recognition for that.

ADD REPLY
2
Entering edit mode
7.7 years ago

ADD COMMENT
1
Entering edit mode

100% working without any flaws:

Are you sure you aren't a bit overconfident? How can you claim there are no flaws, have you tested all possible use cases?

ADD REPLY
0
Entering edit mode

its working fine for my case:) no offence abt confidence

ADD REPLY
0
Entering edit mode
7.7 years ago

reference: how to remove rows based on certain characters

$ cat data.tsv 
CC2_B   ta_iwgsc_7bs_v1_3148968_39029   33
CC2_B   ta_iwgsc_7bs_v1_3150171_39041   38
CC2_D   ta_iwgsc_7ds_v1_3966917_41463   156
CC3_B   ta_iwgsc_7bs_v1_3148968_41273   56
CC2_A   ta_iwgsc_6al_v1_5830987_31258   18
CC2_B   ta_iwgsc_6bl_v1_4279451_30909   18
CC2_D   ta_iwgsc_6dl_v1_3311975_32342   18
CI2_A   ta_iwgsc_6al_v1_5830987_27002   30
CI2_B   ta_iwgsc_6bl_v1_4279451_26849   30
CI2_D   ta_iwgsc_6dl_v1_3311975_28474   30

$ cat data.tsv | sort -t $'\t' -k 1,1 -k 3,3nr | sort -t $'\t' -k 1,1 -u | cut -f 1,3
CC2_A   18
CC2_B   38
CC2_D   156
CC3_B   56
CI2_A   30
CI2_B   30
CI2_D   30

ADD COMMENT
0
Entering edit mode

but i need each block to give me separate results particularly.........nt my file as whole

i want each library(CC1_A,CC2_B ETC) to have its highest count,as you can see the counts differ for same library. and print each library with following format for each seperate block(paragraph) :

ADD REPLY
0
Entering edit mode

oh, no, a little bug. corrected now.

ADD REPLY
0
Entering edit mode

i want seperate results delimited by Searching cagcaccaccaagauucacau for file single file data.tsv**

Searching cagcaccaccaagauucacau:
    CC2_B 38
    CC2_D 156
    CC3_B 56
    CC2_A 18
    CI1_A 30
    CI1_B 30
    CI1_D 30
    Searching ugccuggcucccugaaugcca:
    CC2_B 3575
    CC3_B  3449 
    CI1_B 3509          
    CC2_A 1809  
    CC2_B 1809  
    CC2_D 1809
ADD REPLY
0
Entering edit mode

you have to write scripts by yourself. it's not hard.

ADD REPLY
0
Entering edit mode

highest count is not considered

ADD REPLY

Login before adding your answer.

Traffic: 2332 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6