Hi
I'm guessing that this is due to gene homology. I'm trying to create heat maps in R, which requires that each entree (a row for each gene, with gene copy counts from all replicates) will be unique, so such duplicate gene names cause a problem.
I'd be happy to get suggestions on what should be done in such a case.
I presume what you've encountered are either paralogs or gene isoforms.
There are two approaches you can try to get rid of this "redundancy" for the purpose of getting a single gene-equivalent.
1) Cluster the transcriptome at some sequence identity threshold (e.g., 90% coverage by the longer sequence over the shorter sequence) using a tool like
MMseqs2
orCD-HIT
.2) If the transcriptome is de novo assembled using an assembler like
Trinity
, you can take advantage of the gene-isoform relationships indicated in the sequence headers to retain one isoform per gene "cluster".Technically you could apply both options together also (first option 2 then 1 in this case).
In any case, you can sum up the counts for each set of sequences now represented by your chosen sequence and simply assign those counts to it. These can then be supplied to your heatmap plotting function.