Question

Sort and extract common entries from a txt file.

0

Entering edit mode

4.6 years ago

Anisur Rahman ▴ 80

I have a tab-delimited txt file that contains my data with many columns and lines. Here, to explain my problem I'm using a dummy file. This file has some columns. I want to perform the following action serially;

Sort data file based on the value of columns A and B. This step may generate two sorted copies of my original data but, these files are not mandatory.
Then I want to extract the top five data from each sorted copy and generated two files (as for example; TopA, TopB).
You may notice, these files contain some common numbers in the first (Pos.) column. Here, in the example number 302, 941 and 699 are common in Pos. column of all files (TopA, TopB). Thus, my target is to extract only those data which contain a common number in Pos. column in all files and save them in result.txt file.

Would anyone please help me with a bash/perl/python code to get this result? Thanks in advance.

Datafile

Pos.    DNA %GC A   B   C
644 CGGAGGU 52.6    0.876   76.2    102.3
302 GGUACGG 31.6    0.883   83.6    100.9
1067    GCUUAGU 42.1    0.873   76.6    99.7
1191    GGAGCUG 42.1    0.872   75.3    99.3
105 GACACUG 52.6    0.84    68.1    98.6
941 CCGCAAU 42.1    0.879   76.8    98.2
961 GCGUUUG 36.8    0.861   78  98.2
699 CGACGAA 36.8    0.875   84.7    98.1
663 GGAUAUC 47.4    0.867   77.5    97.1
566 GCUUCGA 52.6    0.802   62.6    96.7

TopA

Pos.    DNA %GC A   B
302 GGUACGG 31.6    0.883   83.6
941 CCGCAAU 42.1    0.879   76.8
644 CGGAGGU 52.6    0.876   76.2
699 CGACGAA 36.8    0.875   84.7
1067    GCUUAGU 42.1    0.873   76.6

TopB

Pos.    DNA %GC A   B
699 CGACGAA 36.8    0.875   84.7
302 GGUACGG 31.6    0.883   83.6
961 GCGUUUG 36.8    0.861   78
663 GGAUAUC 47.4    0.867   77.5
941 CCGCAAU 42.1    0.879   76.8

Result

Pos.    DNA %GC A   B
302 GGUACGG 31.6    0.883   83.6
941 CCGCAAU 42.1    0.879   76.8
699 CGACGAA 36.8    0.875   84.7

unix perl python • 1.2k views

ADD COMMENT • link updated 4.6 years ago by Kevin Blighe 88k • written 4.6 years ago by Anisur Rahman ▴ 80

0

Entering edit mode

I think this should do it (untested):

I. Split the data (remove the header line first, or alternatively add a | tail -n +2 after the cut commands)

Put the DNA, %GC content aside for later

cut -f 1,2,3 filename | sort -k1,1g > data.txt

Split off column A

cut -f 1,4 filename | sort -k2,2gr | head -n 5 | sort -k1,1g > colA.txt

Split off column B

cut -f 1,5 filename | sort -k2,2gr | head -n 5 | sort -k1,1g > colB.txt

II. Merge them back together

Add the header again

head -n 1 filename.txt > result.txt

Merge the DNA/GC% columns back in

join -1 1 -2 1 -o 1.1,1.2,1.3,2.2,2.3 data.txt <(join -1 1 -2 2 colA.txt colB.txt | tr " " "\t") | tr " " "\t" >> result.txt

edit: formatting issues

ADD REPLY • link 4.6 years ago by cschu181 ★ 2.8k

0

Entering edit mode

After running the final code, I am getting the following error

$ join -1 1 -2 1 -o 1.1,1.2,1.3,2.2,2.3 data.txt <(join -1 1 -2 2 colA.txt colB.txt | tr " " "\t") | tr " " "\t" >> result.txt
join: colB.txt:2: is not sorted: 663    77.5
join: colA.txt:5: is not sorted: 1067   0.873

ADD REPLY • link 4.6 years ago by Anisur Rahman ▴ 80

0

Entering edit mode

Argh. That should be

join -1 1 -2 1 -o 1.1,1.2,1.3,2.2,2.3 data.txt <(join -1 1 -2 1 colA.txt colB.txt | tr " " "\t") | tr " " "\t" >> result.txt

(Note the second join now uses both 1st fields (Pos) instead of the first of file A and the second of file B.)

ADD REPLY • link 4.6 years ago by cschu181 ★ 2.8k

score 0 · Answer 1 · 2020-04-11

Perhaps this will work. You may have to change certain parameters to suit the shell and/or system that you are using. i am using bash on Ubuntu 16.04.

cat datafile.tsv
Pos.    DNA %GC A   B   C
644 CGGAGGU 52.6    0.876   76.2    102.3
302 GGUACGG 31.6    0.883   83.6    100.9
1067    GCUUAGU 42.1    0.873   76.6    99.7
1191    GGAGCUG 42.1    0.872   75.3    99.3
105 GACACUG 52.6    0.84    68.1    98.6
941 CCGCAAU 42.1    0.879   76.8    98.2
961 GCGUUUG 36.8    0.861   78  98.2
699 CGACGAA 36.8    0.875   84.7    98.1
663 GGAUAUC 47.4    0.867   77.5    97.1
566 GCUUCGA 52.6    0.802   62.6    96.7

Sort by column A; take top 5

cat \
  <(head -1 datafile.tsv) \
  <(sed '1d' datafile.tsv | sort -k4 -n -r | head -5) > TopA.tsv
cat TopA.tsv

Pos.    DNA %GC A   B   C
302 GGUACGG 31.6    0.883   83.6    100.9
941 CCGCAAU 42.1    0.879   76.8    98.2
644 CGGAGGU 52.6    0.876   76.2    102.3
699 CGACGAA 36.8    0.875   84.7    98.1
1067    GCUUAGU 42.1    0.873   76.6    99.7

Sort by column B; take top 5

cat \
  <(head -1 datafile.tsv) \
  <(sed '1d' datafile.tsv | sort -k5 -n -r | head -5)> TopB.tsv ;
cat TopB.tsv ;

Pos.    DNA %GC A   B   C
699 CGACGAA 36.8    0.875   84.7    98.1
302 GGUACGG 31.6    0.883   83.6    100.9
961 GCGUUUG 36.8    0.861   78  98.2
663 GGAUAUC 47.4    0.867   77.5    97.1
941 CCGCAAU 42.1    0.879   76.8    98.2

Output common lines by 'Pos.' column:

awk -F "\t" 'FNR==NR {a[$1]=$0; next} {if ($1 in a) {print a[$1]}}' TopA.tsv TopB.tsv ;

Pos.    DNA %GC A   B   C
699 CGACGAA 36.8    0.875   84.7    98.1
302 GGUACGG 31.6    0.883   83.6    100.9
941 CCGCAAU 42.1    0.879   76.8    98.2

...or, to deal with the header and ensure that it's printed as the first line:

awk -F "\t" 'FNR==NR {if (NR==1) {print} else {a[$1]=$0}; next} {if ($1 in a) {print a[$1]}}' TopA.tsv TopB.tsv ;

Pos.    DNA %GC A   B   C
699 CGACGAA 36.8    0.875   84.7    98.1
302 GGUACGG 31.6    0.883   83.6    100.9
941 CCGCAAU 42.1    0.879   76.8    98.2

Kevin