Entering edit mode
3.9 years ago
sharmatina189059
▴
110
Hello Can we retrieve all protein sequences in fasta forrmat from the clusters we get from CD-Hit?
Hello Can we retrieve all protein sequences in fasta forrmat from the clusters we get from CD-Hit?
I have code that will work to achieve this:
https://github.com/jrjhealey/bioinfo-tools/blob/master/ParseCDHIT.py
Just be aware that because of a limitation of the way CD-HIT writes the names out, all your sequences must be uniquely named (and ideally short). You will also need to ensure you run CD-HIT with the -d
parameter set to 0
.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
There is not enough information in this one line question to provide a useful answer. Please add additional details. What are you clustering (DNA/Protein)? Where do you want to retrieve the sequence from?
I am running this command :
This gives me a fasta file having all the representative sequences (longest one) and
.clust
file having all the clusters file.I need to get all protein sequences from clusters0 or cluster 1 and so on for their multiple sequence alignment.
make_multi_seq.pl
(LINK) included in CD-HIT will do what you need based on the description.For doing CD-HIT cluster do we have to merge all proteine in single file, if so how we will do it, or is it possible to do clustering all fasta file by keeping in single directory,