Question

Sorting out CD HiT output

0

Entering edit mode

5.8 years ago

gbl1 ▴ 80

Hi,

I have a CD-hit output looking like this:

## 73 MIXED Sb-40 4 6 66.66666666666667 cluster_stats(length_max=319, length_min=205, length_mean=296.6666666666667, length_variance=2042.2666666666667, length_stdev=45.19144461805428, length_members_max=317, length_members_min=205, length_members_mean=292.2, length_members_variance=2403.2000000000003, length_members_stdev=49.0224438395313, ident_perc_max=97.48, ident_perc_min=85.85, ident_perc_mean=93.986, ident_perc_variance=22.420530000000017, ident_perc_stdev=4.735032206859845, counter=Counter({'Sb-40': 4, 'Sj-A': 2}))

73 0 319 Sj-A_M02764:115:000000000-C3GKK:1:1118:4143:8248 1

73 1 317 Sj-A_M02764:115:000000000-C3GKK:1:2107:8743:9281 1 317 5 319 + 96.85 0

73 3 317 Sb-40_M02764:115:000000000-C3GKK:1:2104:16139:22698 1 317 5 319 + 97.48 0

73 5 317 Sb-40_M02764:115:000000000-C3GKK:1:2115:7096:7098 1 317 5 319 + 94.01 0

73 2 305 Sb-40_M02764:115:000000000-C3GKK:1:1113:14798:13772 1 305 1 319 + 95.74 0

73 4 205 Sb-40_M02764:115:000000000-C3GKK:1:2106:18903:18118 1 205 1 217 + 85.85 0

SJ-A and Sb-40 are two samples, and I want to know how many fragment of each sample I have in each cluster. For exemple for this "73" cluster

# Sj-A Sb-40

73 2 4

How to proceed?

Thanks in advance

cd-hit • 2.7k views

ADD COMMENT • link updated 5.8 years ago by ahaswer ▴ 150 • written 5.8 years ago by gbl1 ▴ 80

score 0 · Answer 1 · 2019-01-31

0

Entering edit mode

5.8 years ago

5heikki 11k

cd-hit ships with many helper scripts, e.g. clusters2txt (or something like that) turns a cluster file into a far more parsing friendly format..

ADD COMMENT • link 5.8 years ago by 5heikki 11k

score 0 · Answer 2 · 2019-01-31

0

Entering edit mode

5.8 years ago

ahaswer ▴ 150

Assuming that the first line of file is the comment line you can simply use awk command:

awk -F'[ _]' 'NF>0 && NR>1 {print $1" "$4}' cdhit.txt | sort | uniq -c | awk '{print $2, $3, $1}' > counts.txt

I'm not sure if your file contains any blank lines. Running the command you will obtain file with structure like this (with counts in the last column):

73 Sb-40 4
73 Sj-A 2

Edit2: assuming multiple comment-like lines; remove multiple whitespaces.

tr -s " " < cdhit.txt | sed '/^#/ d' | awk -F'[ _]' 'NF>0 {print $1" "$4}' | sort | uniq -c | awk '{print $2, $3, $1}' > counts.txt

ADD COMMENT • link 5.8 years ago by ahaswer ▴ 150

0

Entering edit mode

Hmmm… I tried and:

## Sb-40 290

## Sj-A 60

0 0 1

0 1 1

0 10 1

0 100 1

0 1000 1

0 1001 1

0 1002 1

0 1003 1

0 1004 1

0 1005 1

etc…

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Can you provide the file in the code window? I'm not sure about your field separators (specify them). It seems like single whitespaces.

ADD REPLY • link 5.8 years ago by ahaswer ▴ 150

0

Entering edit mode

input file:

## 0 MIXED Sj-A 4267 7229 59.026144694978555 cluster_stats(length_max=587, length_min=63, length_mean=100.38373218979112, length_variance=962.8900839658992, length_stdev=31.030470250479596, length_members_max=582, length_members_min=63, length_members_mean=100.31640841173216, length_members_variance=930.2534072177071, length_members_stdev=30.50005585597684, ident_perc_max=98.96, ident_perc_min=80.42, ident_perc_mean=94.99588682899834, ident_perc_variance=5.127216634208676, ident_perc_stdev=2.264335804205877, counter=Counter({'Sj-A': 4267, 'Sb-40': 2962}))
0   5052    587 Sb-40_M02764:115:000000000-C3GKK:1:1109:23652:14682                         1
0   6391    582 Sb-40_M02764:115:000000000-C3GKK:1:2110:25762:12872 1   582 1   587 +   80.76   0
0   270 290 Sj-A_M02764:115:000000000-C3GKK:1:1102:19199:23933  1   290 303 587 +   82.76   0
0   5426    290 Sb-40_M02764:115:000000000-C3GKK:1:1113:15242:21658 1   290 303 587 +   82.41   0
0   5533    290 Sb-40_M02764:115:000000000-C3GKK:1:1117:26788:14121 1   290 303 587 +   83.45   0
0   5564    290 Sb-40_M02764:115:000000000-C3GKK:1:1117:13232:21408 1   290 303 587 +   82.76   0
0   1453    288 Sj-A_M02764:115:000000000-C3GKK:1:1113:3739:15200   1   288 303 587 +   85.76   0
0   2839    288 Sj-A_M02764:115:000000000-C3GKK:1:2106:9831:10273   1   288 303 587 +   86.11   0
0   2895    288 Sj-A_M02764:115:000000000-C3GKK:1:2110:13364:5014   1   288 303 587 +   84.72   0
0   4750    288 Sb-40_M02764:115:000000000-C3GKK:1:1106:28205:8430  1   288 303 587 +   86.11   0
0   6472    288 Sb-40_M02764:115:000000000-C3GKK:1:2111:8362:20071  1   288 303 587 +   85.76   0
0   6938    288 Sb-40_M02764:115:000000000-C3GKK:1:2117:7499:19237  1   288 303 587 +   85.07   0
0   7149    288 Sb-40_M02764:115:000000000-C3GKK:1:2117:8321:6441   1   288 303 587 +   80.9    0
0   82  287 Sj-A_M02764:115:000000000-C3GKK:1:1101:17157:13826  1   287 303 587 +   84.67   0
0   290 287 Sj-A_M02764:115:000000000-C3GKK:1:1103:7587:4546    1   287 303 587 +   86.76   0    
0   740 287 Sj-A_M02764:115:000000000-C3GKK:1:1106:27250:11889  1   287 302 587 +   82.93   0

Output:

## Sb-40 290
## Sj-A 60
0 0 1
0 1 1
0 10 1
0 100 1
0 1000 1
0 1001 1
0 1002 1

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Try this:

awk 'BEGIN{FS=" "}{if(/^[1-9]/){print $1,$4}}' file.txt \
    | awk 'BEGIN{FS="_"}{print $1}' \
    | sort \
    | uniq -c \
    | sed 's/ \+//' \
    | awk 'BEGIN{FS=" "}{print $2,$3,$1}'

ADD REPLY • link 5.8 years ago by 5heikki 11k

0

Entering edit mode

Quiet ok

1 Sb-40 1
1 Sj-A 1
10 Sb-40 1
100 Sj-A 1
101 Sb-40 1
102 Sj-A 1
103 Sb-40 1
104 Sj-A 1
105 Sb-40 1
106 Sb-40 1
107 Sb-40 4
107 Sj-A 1

I'll sort that out in a spreadsheet … thanks

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Issue: it forgets the 0 values…

1   Sj-A    1   1   Sb-40   1
2   Sj-A    2   2   Sb-40   7
7   Sj-A    240 3   Sb-40   1
9   Sj-A    30443   4   Sb-40   1
17  Sj-A    2   5   Sb-40   1
19  Sj-A    1   6   Sb-40   4
23  Sj-A    1   7   Sb-40   363

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Replace [1-9] with [0-9] in awk command provided by 5heikki.

ADD REPLY • link 5.8 years ago by ahaswer ▴ 150

0

Entering edit mode

It still omit to say when there is 0

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

Basically, the final output I need is something like:

    id  Sj-A    Sb-40
    0   4267    2962
    1   1   1
    2   2   7
    3   0   1
    4   0   1
    5   0   1
    6   0   4
    7   240 363
    8   0   1
    9   30443   20499
    10  0   1

I need to know when there is 0 for one plant in a cluster

ADD REPLY • link 5.8 years ago by gbl1 ▴ 80

0

Entering edit mode

I recommend you take this to stackoverflow (be sure to describe your problem well)

ADD REPLY • link 5.8 years ago by 5heikki 11k