Entering edit mode
4.3 years ago
jamie.pike
▴
80
I have recently clustered a set of proteins using CD-HIT and I was wondering if anyone could recommend a nice way to visualise the clusters (there are 61 clusters in total)?
Example of clusters from CD-HIT:
>Cluster 0
0 574aa, >GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_39024-44226_26... *
>Cluster 1
0 401aa, >GCA_000260195.2_FO_II5_V1_genomic.fna_Candidate_Sequence_g6.t1... *
1 108aa, >GCA_000260195.2_FO_II5_V1_genomic.fna_Candidate_Sequence_59093-64307_60... at 93.52%
2 401aa, >GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_g5.t1... at 100.00%
3 108aa, >GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_16327-21541_28... at 93.52%
4 401aa, >GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_g20.t1... at 99.75%
5 108aa, >GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_25103-30317_60... at 92.59%
6 401aa, >GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g13.t1... at 100.00%
7 108aa, >GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_2796127-2801341_60... at 93.52%
8 401aa, >GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_g6.t1... at 100.00%
9 108aa, >GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_5739325-5744539_60... at 93.52%
>Cluster 2
0 373aa, >GCA_005930515.1_160527_genomic.fna_Candidate_Sequence_g6.t1... *
>Cluster 3
0 371aa, >GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_g2.t1... *
1 79aa, >GCA_000350365.1_Foc4_1.0_B2_genomic.fna_Candidate_Sequence_26225-31435_21... at 98.73%
2 30aa, >GCA_001696625.1_C1HIR_9889_genomic.fna_Candidate_Sequence_760-3765_21... at 100.00%
3 371aa, >GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_g18.t1... at 99.73%
4 79aa, >GCA_007994515.1_UK0001_genomic.fna_Candidate_Sequence_2897725-2902935_56... at 97.47%
5 371aa, >GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_g13.t1... at 99.73%
6 79aa, >GWHAASU00000000_FocTR4_58.genomic.fna_Candidate_Sequence_5872866-5878075_56... at 97.47%
Visualize in what way? You could build multiple sequence alignments for each of these clusters.
I was thinking something along the lines of a heatmap that is binary, if that makes sense? The protein clusters are from 9 genomes, 5 from one strain, 4 from another. I was hoping to try and create something that demonstrates a proteins presence in one strain and absence in another. I wasn't sure if there was something out there that would already be capable of doing that. If not I could potentially pick some clusters for MSA and present that. I suppose theres always a simple table indicating presence or absence. Just thought I'd see if anyone had any ideas