Question

Efficiently Converting CD-HIT-EST Clusters into .fasta Files

0

Entering edit mode

23 days ago

mtsrn • 0

Hello helpful people,

I'm working with CD-HIT-EST to cluster nucleotide data. We are looking to take the output of each CD-HIT cluster, create a .fasta file with each of the sequences mentioned in each cluster, and then use each of those .fasta files to generate a multiple sequence alignment.

I'm writing to ask if there is a function or utility that can be used for this purpose? I looked around online and on biostars for something that someone had previously used to do this, but I was not able to find anything. I can make a script (probably using biopython) that will work to pull out the sequences but was hoping someone had done this before me and I could use their work.

Are there any tools that exist/python libraries that are recommended that I can use to generate .fasta files with sequences from each of the clusters CD-HIT-EST generated?

python MSA clustering cd-hit • 770 views

ADD COMMENT • link updated 20 days ago by Joe 22k • written 23 days ago by mtsrn • 0

0

Entering edit mode

Does not appear that there is a single tool to do this. You are on the right track. GPT was able to generate a code recommendation that was 3 steps. Ask "how to extract fasta consensus from cd-hit clusters".

ADD REPLY • link 23 days ago by GenoMax 153k

0

Entering edit mode

I have tried that a few times, and even with shiny new GPT 5 it was not able to make something that actually worked. Maybe I am not giving it enough context but it hasn't been able to properly extract multiple transcript sequences from the fasta.clstr file, pull them the .fasta file, and put them into a new file.

ADD REPLY • link 23 days ago by mtsrn • 0

0

Entering edit mode

extract multiple transcript sequences

As in different splice variants?

ADD REPLY • link 23 days ago by GenoMax 153k

0

Entering edit mode

We have a rna-seq transcriptome that we ran CD-HIT-EST on, so it should contain different splice variants but also just different genes. I may be misunderstanding your question though.

The .clstr file looks like this:

>Cluster 0
0   13149nt, >IsoSeq_HQ_transcript/0... *
1   11408nt, >IsoSeq_HQ_transcript/3... at +/99.97%
2   10281nt, >IsoSeq_HQ_transcript/11... at +/99.97%
3   10020nt, >IsoSeq_HQ_transcript/15... at +/99.97%

There was a frameshift mutation in some transcripts during sequencing, and we are trying to identify when/where the mutation occurred. We want to use this data to manually verify sequence similarity (For example, the first 50% of our sequence is identical and the last 50% is totally new. If we add this a/t/c/g into our transcript, it is now highly similar/identical to our sequence of interest).

I was looking for a script that takes IsoSeq_HQ_transcript/0 out of the .fasta.clstr file and put them into a .fasta file with the name Cluster_#.fasta, even if there is one sequence per cluster.

ADD REPLY • link 23 days ago by mtsrn • 0

0

Entering edit mode

Looking at the name of the sequences, it looks like this is a PacBio IsoSeq dataset. Is CD-HIT the only analysis you have tried so far? Perhaps the cluster you are looking at above is not full length sequences. Did you consider that possibility?

In case you have not tried it already, you may want to use a tool that is provided by PacBio for this specific analysis. https://isoseq.how/

ADD REPLY • link 23 days ago by GenoMax 153k

0

Entering edit mode

The python script I suggested below does just that. The only thing you must do during the clustering step is set -d to a large number (like 200-300) so that the whole sequence name is preserved in cluster file.

ADD REPLY • link 22 days ago by Mensur Dlakic ★ 29k

score 0 · Answer 1 · 2025-08-08

0

Entering edit mode

23 days ago

Mensur Dlakic ★ 29k

You may want to try ParseCDHIT.py from this collection. I think Joe might eventually see your post should you have additional questions.

ADD COMMENT • link 23 days ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thanks for linking that Mensur.

@OP, the script linked above will do what you are looking for I think, but just a few words of warning:

There actually isn't a 100% reliable way to do this to the best of my knowledge (and I was never able to find any existing tools or built in functions either, but maybe there are some more recent updates have changed this - it was a long time ago).
My script will work, but you must make absolutely sure that the headers of your sequences are all unique, and differentiate between the sequences adequately ideally within the first 10-20 characters. Even with the -d 0 flag that the script instructions mention, CD-HIT will still truncate the Seq IDs in the .clstr file if memory serves.
It's highly likely the script can be improved upon, as I wrote it a way back. It will work assuming nothing has changed in the output format.

ADD REPLY • link 20 days ago by Joe 22k