Question

CD-HIT Clustering output

0

Entering edit mode

10.3 years ago

BSP • 0

Hello,

Currently I am analysing several metagenome and transcriptome datasets. For downstream analysis I am clustering these datasets with CD-HIT. Unfortunately CD-HIT produces no default output with an overview of sequence abundance within each cluster (simple tab delimited file). I found an old script which grabs this information but it uses the .bak.clstr output which was only available till version 4.3. Is there a way to produce this output in the newest version or an alternative way to produce a simple sequence abundance to cluster name overview?

Sincerely,
Felix

Clustering CD-HIT • 11k views

ADD COMMENT • link written 10.3 years ago by BSP • 0

0

Entering edit mode

By sequence abundance do you mean how many times the duplicate sequence is found in the cluster?

ADD REPLY • link 10.3 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Yes, exactly!

ADD REPLY • link updated 2.9 years ago by Ram 44k • written 10.3 years ago by BSP • 0

0

Entering edit mode

Check if this useful. (assuming no empty lines between Cluster 0 and Cluster 1)

My input:

>Cluster 0
0       15679nt, >SB1234_Contig35475... at +/99.99%
1       15436nt, >SB1234_Contig35476... at +/99.62%
2       15764nt, >Contig18540... *
3       15438nt, >Contig39392... at +/99.69%
4       15679nt, >comp263440_c8_seq4... at -/99.99%
5       15667nt, >comp263440_c8_seq6... at -/99.99%
>Cluster 1
0       15684nt, >SB1234_Contig35474... at +/99.98%
1       15685nt, >Contig11682... *
2       15684nt, >comp263440_c8_seq3... at -/99.98%
3       15672nt, >comp263440_c8_seq5... at -/99.97%

My script

$TotalClusters=`grep -c '>Cluster' cd-hit.test.txt`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i/../^>Cluster $j/);}' cd-hit.test.txt | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}

OUTPUT

Cluster 0 has 6 sequences
Cluster 1 has 4 sequences

Is this what you wanted?

ADD REPLY • link 10.3 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Dear Prakki, I wonder if you have a solution if within each Cluster are different individuals like this:

>Cluster 0
0       15679nt, >SpecA_Contig35475... at +/99.99%
1       15436nt, >SpecA_Contig35476... at +/99.62%
2       15764nt, >SpecB_Contig18540... *
3       15438nt, >SpecA_Contig39392... at +/99.69%
4       15679nt, >SpecC_comp263440_c8_seq4... at -/99.99%
>Cluster 1
0       15684nt, >SpecC_SB1234_Contig35474... at +/99.98%
1       15685nt, >SpecC_Contig11682... *
>Cluster 2
0       15684nt, >SpecA_comp263440_c8_seq3... at -/99.98%
1       15672nt, >SpecB_comp263440_c8_seq5... at -/99.97%

and I would like to find out how many Clusters own only one, two or all three species as it is the case in Cluster 0?

Thanks a lot!

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by Elli • 0

0

Entering edit mode

the above script should serve your purpose!

ADD REPLY • link 9.1 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Is this supposed to run in bash?

I get the following output...

./cdhit_parser.sh: line 2: =291: command not found
./cdhit_parser.sh: line 3: syntax error near unexpected token `('
./cdhit_parser.sh: line 3: `for($i=0;$i<$TotalClusters;$i++)'

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 7.5 years ago by germanbr • 0

0

Entering edit mode

Dear Prakki, I find your script super useful. This is something I exactly need. More specifically I need this script to split my cd-hit output for clusters having 1 sequence and more than 1 sequences. What I do is I rename my output.clstr file as cd-hit.test.txt. Then I copy your script and save in the same folder as perl clstr_num.pl. Then I simply run it by

$ perl clstr_num.pl

My output.clstr or cd-hit.test.txt file looks like this

 >Cluster 0
0   22431nt, wt1_800... *
>Cluster 1
0   21560nt, mock2_6320... *
>Cluster 2
0   2668nt, wt2_2859... at 99.06%
1   4470nt, wt2_5059... at 98.95%
2   17970nt, h-stress3_1002... *
>Cluster 3
0   17850nt, wt2_422... *
1   30nt, mock1_3879... at 100.00%
2   5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0   17798nt, wt2_704... *
1   24nt, wt2_2505... at 100.00%
2   3064nt, h-stress2_3822... at 99.58%
3   3000nt, inf3_1050... at 99.67%
4   2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0   4826nt, wt2_2557... at 99.11%
1   17143nt, h-stress1_643... *
>Cluster 6
0   16920nt, c-stress3_3308... *

but the script outputs

Cluster 0 has 1 sequences
Cluster 1 has 25423 sequences
Cluster 2 has 24537 sequences
Cluster 3 has 25804 sequences
Cluster 4 has 23779 sequences
Cluster 5 has 23693 sequences
Cluster 6 has 8353 sequences

Where I expect the output

Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 4 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences

Can you please guide me how to get the expected output? Please excuse my childish language, I have 0 coding knowledge. It's actually my 1st comment on biostar. Your help would be a huge favor and highly appreciated. Thanks much, Shan

ADD REPLY • link 5.4 years ago by Syed Shan-e-Ali Zaidi • 0

0

Entering edit mode

Hi @syed. I just saw your message. The above script has actually worked for me. Let me show you, how I did.

$ cat test.fasta.clstr
>Cluster 0
0   22431nt, wt1_800... *
>Cluster 1
0   21560nt, mock2_6320... *
>Cluster 2
0   2668nt, wt2_2859... at 99.06%
1   4470nt, wt2_5059... at 98.95%
2   17970nt, h-stress3_1002... *
>Cluster 3
0   17850nt, wt2_422... *
1   30nt, mock1_3879... at 100.00%
2   5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0   17798nt, wt2_704... *
1   24nt, wt2_2505... at 100.00%
2   3064nt, h-stress2_3822... at 99.58%
3   3000nt, inf3_1050... at 99.67%
4   2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0   4826nt, wt2_2557... at 99.11%
1   17143nt, h-stress1_643... *
>Cluster 6
0   16920nt, c-stress3_3308... *

$ cat test.pl
$TotalClusters=`grep -c '>Cluster' test.fasta.clstr`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i\n/../^>Cluster $j\n/);}' test.fasta.clstr | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}

$ perl test.pl
Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 5 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences

I just had to change the filenames. I did not change any code. Make sure you input is properly formatted as the example used above. Then, I guess the script should work.

ADD REPLY • link 3.9 years ago by Prakki Rama ★ 2.7k

0

Entering edit mode

Hi there,

This script does not work very well - wrong count. Particulary for large output..

ADD REPLY • link 5.1 years ago by geizetomazetto • 0

score 2 · Answer 1 · 2020-12-08

2

Entering edit mode

4.0 years ago

bagdevi.mishra ▴ 110

In this code, when Cluster 2 is expected to be picked Cluster 20000 will be picked if there are 20000 number of clusters, if there are 27 clusters, Cluster 27 will be picked. In the examples it looks as if it worked fine because there are less than 10 Clusters in the examples.

add \n to Cluster $i and Cluster $j as follows and the script would work fine.

$lines=perl -e 'while (<>){print if (/^>Cluster $i\n/../^>Cluster $j\n/);}' clusters_0.5n2.clstr | wc -l;