I have a tab-delimited txt file that contains my data with many columns and lines. Here, to explain my problem I'm using a dummy file. This file has some columns. I want to perform the following action serially;
Sort data file based on the value of columns A and B. This step may generate two sorted copies of my original data but, these files are not mandatory.
Then I want to extract the top five data from each sorted copy and generated two files (as for example; TopA, TopB).
You may notice, these files contain some common numbers in the first (Pos.) column. Here, in the example number 302, 941 and 699 are common in Pos. column of all files (TopA, TopB). Thus, my target is to extract only those data which contain a common number in Pos. column in all files and save them in result.txt file.
Would anyone please help me with a bash/perl/python code to get this result? Thanks in advance.
Datafile
Pos. DNA %GC A B C
644 CGGAGGU 52.6 0.876 76.2 102.3
302 GGUACGG 31.6 0.883 83.6 100.9
1067 GCUUAGU 42.1 0.873 76.6 99.7
1191 GGAGCUG 42.1 0.872 75.3 99.3
105 GACACUG 52.6 0.84 68.1 98.6
941 CCGCAAU 42.1 0.879 76.8 98.2
961 GCGUUUG 36.8 0.861 78 98.2
699 CGACGAA 36.8 0.875 84.7 98.1
663 GGAUAUC 47.4 0.867 77.5 97.1
566 GCUUCGA 52.6 0.802 62.6 96.7
TopA
Pos. DNA %GC A B
302 GGUACGG 31.6 0.883 83.6
941 CCGCAAU 42.1 0.879 76.8
644 CGGAGGU 52.6 0.876 76.2
699 CGACGAA 36.8 0.875 84.7
1067 GCUUAGU 42.1 0.873 76.6
TopB
Pos. DNA %GC A B
699 CGACGAA 36.8 0.875 84.7
302 GGUACGG 31.6 0.883 83.6
961 GCGUUUG 36.8 0.861 78
663 GGAUAUC 47.4 0.867 77.5
941 CCGCAAU 42.1 0.879 76.8
Result
Pos. DNA %GC A B
302 GGUACGG 31.6 0.883 83.6
941 CCGCAAU 42.1 0.879 76.8
699 CGACGAA 36.8 0.875 84.7
I think this should do it (untested):
I. Split the data (remove the header line first, or alternatively add a
| tail -n +2
after the cut commands)Put the DNA, %GC content aside for later
Split off column A
Split off column B
II. Merge them back together
Add the header again
Merge the DNA/GC% columns back in
edit: formatting issues
After running the final code, I am getting the following error
Argh. That should be
(Note the second join now uses both 1st fields (Pos) instead of the first of file A and the second of file B.)