Currently I am analysing several metagenome and transcriptome datasets. For downstream analysis I am clustering these datasets with CD-HIT. Unfortunately CD-HIT produces no default output with an overview of sequence abundance within each cluster (simple tab delimited file). I found an old script which grabs this information but it uses the .bak.clstr output which was only available till version 4.3. Is there a way to produce this output in the newest version or an alternative way to produce a simple sequence abundance to cluster name overview?
./cdhit_parser.sh: line 2: =291: command not found
./cdhit_parser.sh: line 3: syntax error near unexpected token `('
./cdhit_parser.sh: line 3: `for($i=0;$i<$TotalClusters;$i++)'
Dear Prakki,
I find your script super useful. This is something I exactly need. More specifically I need this script to split my cd-hit output for clusters having 1 sequence and more than 1 sequences. What I do is I rename my output.clstr file as cd-hit.test.txt. Then I copy your script and save in the same folder as perl clstr_num.pl. Then I simply run it by
$ perl clstr_num.pl
My output.clstr or cd-hit.test.txt file looks like this
>Cluster 0
0 22431nt, wt1_800... *
>Cluster 1
0 21560nt, mock2_6320... *
>Cluster 2
0 2668nt, wt2_2859... at 99.06%
1 4470nt, wt2_5059... at 98.95%
2 17970nt, h-stress3_1002... *
>Cluster 3
0 17850nt, wt2_422... *
1 30nt, mock1_3879... at 100.00%
2 5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0 17798nt, wt2_704... *
1 24nt, wt2_2505... at 100.00%
2 3064nt, h-stress2_3822... at 99.58%
3 3000nt, inf3_1050... at 99.67%
4 2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0 4826nt, wt2_2557... at 99.11%
1 17143nt, h-stress1_643... *
>Cluster 6
0 16920nt, c-stress3_3308... *
but the script outputs
Cluster 0 has 1 sequences
Cluster 1 has 25423 sequences
Cluster 2 has 24537 sequences
Cluster 3 has 25804 sequences
Cluster 4 has 23779 sequences
Cluster 5 has 23693 sequences
Cluster 6 has 8353 sequences
Where I expect the output
Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 4 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences
Can you please guide me how to get the expected output?
Please excuse my childish language, I have 0 coding knowledge. It's actually my 1st comment on biostar.
Your help would be a huge favor and highly appreciated.
Thanks much,
Shan
Hi @syed. I just saw your message. The above script has actually worked for me. Let me show you, how I did.
$ cat test.fasta.clstr
>Cluster 0
0 22431nt, wt1_800... *
>Cluster 1
0 21560nt, mock2_6320... *
>Cluster 2
0 2668nt, wt2_2859... at 99.06%
1 4470nt, wt2_5059... at 98.95%
2 17970nt, h-stress3_1002... *
>Cluster 3
0 17850nt, wt2_422... *
1 30nt, mock1_3879... at 100.00%
2 5548nt, c-stress2_3811... at 99.33%
>Cluster 4
0 17798nt, wt2_704... *
1 24nt, wt2_2505... at 100.00%
2 3064nt, h-stress2_3822... at 99.58%
3 3000nt, inf3_1050... at 99.67%
4 2727nt, c-stress2_2857... at 98.57%
>Cluster 5
0 4826nt, wt2_2557... at 99.11%
1 17143nt, h-stress1_643... *
>Cluster 6
0 16920nt, c-stress3_3308... *
$ cat test.pl
$TotalClusters=`grep -c '>Cluster' test.fasta.clstr`;
for($i=0;$i<$TotalClusters;$i++)
{
$j=$i+1;
$lines=`perl -e 'while (<>){print if (/^>Cluster $i\n/../^>Cluster $j\n/);}' test.fasta.clstr | wc -l`;
$linesExcludingPattern=$lines-2;
print "Cluster $i has $linesExcludingPattern sequences\n";
}
$ perl test.pl
Cluster 0 has 1 sequences
Cluster 1 has 1 sequences
Cluster 2 has 3 sequences
Cluster 3 has 3 sequences
Cluster 4 has 5 sequences
Cluster 5 has 2 sequences
Cluster 6 has 1 sequences
I just had to change the filenames. I did not change any code. Make sure you input is properly formatted as the example used above. Then, I guess the script should work.
In this code, when Cluster 2 is expected to be picked Cluster 20000 will be picked if there are 20000 number of clusters, if there are 27 clusters, Cluster 27 will be picked. In the examples it looks as if it worked fine because there are less than 10 Clusters in the examples.
add \n to Cluster $i and Cluster $j as follows and the script would work fine.
By sequence abundance do you mean how many times the duplicate sequence is found in the cluster?
Yes, exactly!
Check if this useful. (assuming no empty lines between Cluster 0 and Cluster 1)
My input:
My script
OUTPUT
Is this what you wanted?
Dear Prakki, I wonder if you have a solution if within each Cluster are different individuals like this:
and I would like to find out how many Clusters own only one, two or all three species as it is the case in Cluster 0?
Thanks a lot!
the above script should serve your purpose!
Is this supposed to run in bash?
I get the following output...
Dear Prakki, I find your script super useful. This is something I exactly need. More specifically I need this script to split my cd-hit output for clusters having 1 sequence and more than 1 sequences. What I do is I rename my output.clstr file as cd-hit.test.txt. Then I copy your script and save in the same folder as perl clstr_num.pl. Then I simply run it by
My output.clstr or cd-hit.test.txt file looks like this
but the script outputs
Where I expect the output
Can you please guide me how to get the expected output? Please excuse my childish language, I have 0 coding knowledge. It's actually my 1st comment on biostar. Your help would be a huge favor and highly appreciated. Thanks much, Shan
Hi @syed. I just saw your message. The above script has actually worked for me. Let me show you, how I did.
I just had to change the filenames. I did not change any code. Make sure you input is properly formatted as the example used above. Then, I guess the script should work.
Hi there,
This script does not work very well - wrong count. Particulary for large output..