Question

How to deal with multiple transcripts of the same gene?

0

Entering edit mode

3.2 years ago

cthangav ▴ 110

I'm working with RNA-seq data that has multiple entries per gene symbol (these appear to be alternatively spliced gene transcripts) like this:

Name    Length  EffectiveLength TPM NumReads
ENSMUST00000162897.1|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000086625.1|Xkr4-003|Xkr4|4153|processed_transcript|    4153    3904    0   0
SMUST00000159265.1|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000086624.1|Xkr4-002|Xkr4|2989|processed_transcript|  2989    2740    0   0
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|protein_coding|  3634    3385    0   0

The software I'm using (called PECA) only looks at the gene symbol and TPM- so it takes the gene symbol Xkr4 and may assign multiple values to it. I'm wondering if there is a good way to simplify this file so there is only one value per gene symbol.

Could I just remove the ones that aren't protein coding? Or take the sum/average for all the transcript variations of a given gene?

Thanks for your advice.

biology RNA-seq splicing transcripts RNA. genes • 1.6k views

ADD COMMENT • link updated 3.2 years ago by ATpoint 88k • written 3.2 years ago by cthangav ▴ 110

0

Entering edit mode

Actually further down it seems there are multiple coding transcripts per gene.

ENSMUST00000027035.7|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127245.2|Sox17-001|Sox17|3127|protein_coding|   3127    2878    0   0
ENSMUST00000195555.1|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127249.1|Sox17-005|Sox17|1977|protein_coding|   1977    1728    0   0
ENSMUST00000192650.3|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127247.2|Sox17-004|Sox17|3242|protein_coding|   3242    2993    0   0
ENSMUST00000116652.5|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127246.1|Sox17-002|Sox17|1512|protein_coding|   1512    1263    0.0111391   1
ENSMUST00000192505.1|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127268.1|Sox17-008|Sox17|1148|retained_intron|  1148    899 0   0
ENSMUST00000191647.1|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127267.1|Sox17-007|Sox17|406|protein_coding|    406 157 0   0
ENSMUST00000191939.1|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127266.1|Sox17-006|Sox17|840|protein_coding|    840 591 0   0
ENSMUST00000192913.1|ENSMUSG00000025902.11|OTTMUSG00000050014.6|OTTMUST00000127248.1|Sox17-003|Sox17|1506|protein_coding|   1506    1257    0   0

Is there a way to consolidate these? I'm seen RNA-seq data sets with just one TPM per gene symbol but not sure how these files consolidate multiple transcripts.

Thanks for your advice.

ADD REPLY • link 3.2 years ago by cthangav ▴ 110

score 1 · Accepted Answer · 2022-03-01

The usual way in a gene-level (rather than transcript-level) analysis is to aggregate the transcripts to the gene level. Most naively that would simply be summing the counts of the transcripts per gene. The better and elaborate way would be to use https://bioconductor.org/packages/release/bioc/html/tximport.html for exactly that. It has a mode for salmon output, so it's trivially easy to do, just follow the manual. You end up with a single line per gene.