Question

RAST annotation of RNA-seq transcriptome results in some genes appearing more than once

0

Entering edit mode

14 months ago

langziv ▴ 70

Hi

I'm guessing that this is due to gene homology. I'm trying to create heat maps in R, which requires that each entree (a row for each gene, with gene copy counts from all replicates) will be unique, so such duplicate gene names cause a problem.

I'd be happy to get suggestions on what should be done in such a case.

R RNA-seq • 736 views

ADD COMMENT • link updated 13 months ago by Dunois ★ 2.8k • written 14 months ago by langziv ▴ 70

1

Entering edit mode

I presume what you've encountered are either paralogs or gene isoforms.

There are two approaches you can try to get rid of this "redundancy" for the purpose of getting a single gene-equivalent.

1) Cluster the transcriptome at some sequence identity threshold (e.g., 90% coverage by the longer sequence over the shorter sequence) using a tool like MMseqs2 or CD-HIT.

2) If the transcriptome is de novo assembled using an assembler like Trinity, you can take advantage of the gene-isoform relationships indicated in the sequence headers to retain one isoform per gene "cluster".

Technically you could apply both options together also (first option 2 then 1 in this case).

In any case, you can sum up the counts for each set of sequences now represented by your chosen sequence and simply assign those counts to it. These can then be supplied to your heatmap plotting function.

ADD REPLY • link 13 months ago by Dunois ★ 2.8k

score 1 · Answer 1 · 2023-10-31

As always, depends on what is your research question. If you are interested in "functionality" or taking the analysis at gene level, then feel free to sum up the reads from the transcripts. If you are interested in keeping the transcripts separated, just rename them? Depending on the organism and the identity between the paralogs, the latter might make more sense anyway, given that different transcripts might still function differently (eg due to mutation or differences in promoters, splicing and what not).