Hello helpful people,
I'm working with CD-HIT-EST to cluster nucleotide data. We are looking to take the output of each CD-HIT cluster, create a .fasta file with each of the sequences mentioned in each cluster, and then use each of those .fasta files to generate a multiple sequence alignment.
I'm writing to ask if there is a function or utility that can be used for this purpose? I looked around online and on biostars for something that someone had previously used to do this, but I was not able to find anything. I can make a script (probably using biopython) that will work to pull out the sequences but was hoping someone had done this before me and I could use their work.
Are there any tools that exist/python libraries that are recommended that I can use to generate .fasta files with sequences from each of the clusters CD-HIT-EST generated?
Does not appear that there is a single tool to do this. You are on the right track. GPT was able to generate a code recommendation that was 3 steps. Ask "how to extract fasta consensus from cd-hit clusters".
I have tried that a few times, and even with shiny new GPT 5 it was not able to make something that actually worked. Maybe I am not giving it enough context but it hasn't been able to properly extract multiple transcript sequences from the fasta.clstr file, pull them the .fasta file, and put them into a new file.
As in different splice variants?
We have a rna-seq transcriptome that we ran CD-HIT-EST on, so it should contain different splice variants but also just different genes. I may be misunderstanding your question though.
The .clstr file looks like this:
There was a frameshift mutation in some transcripts during sequencing, and we are trying to identify when/where the mutation occurred. We want to use this data to manually verify sequence similarity (For example, the first 50% of our sequence is identical and the last 50% is totally new. If we add this a/t/c/g into our transcript, it is now highly similar/identical to our sequence of interest).
I was looking for a script that takes
IsoSeq_HQ_transcript/0
out of the.fasta.clstr
file and put them into a.fasta
file with the nameCluster_#.fasta
, even if there is one sequence per cluster.Looking at the name of the sequences, it looks like this is a PacBio IsoSeq dataset. Is CD-HIT the only analysis you have tried so far? Perhaps the cluster you are looking at above is not full length sequences. Did you consider that possibility?
In case you have not tried it already, you may want to use a tool that is provided by PacBio for this specific analysis. https://isoseq.how/
The python script I suggested below does just that. The only thing you must do during the clustering step is set
-d
to a large number (like 200-300) so that the whole sequence name is preserved in cluster file.