I'm working with RNA-seq data that has multiple entries per gene symbol (these appear to be alternatively spliced gene transcripts) like this:
Name Length EffectiveLength TPM NumReads
ENSMUST00000162897.1|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000086625.1|Xkr4-003|Xkr4|4153|processed_transcript| 4153 3904 0 0
SMUST00000159265.1|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000086624.1|Xkr4-002|Xkr4|2989|processed_transcript| 2989 2740 0 0
ENSMUST00000070533.4|ENSMUSG00000051951.5|OTTMUSG00000026353.2|OTTMUST00000065166.1|Xkr4-001|Xkr4|3634|protein_coding| 3634 3385 0 0
The software I'm using (called PECA) only looks at the gene symbol and TPM- so it takes the gene symbol Xkr4 and may assign multiple values to it. I'm wondering if there is a good way to simplify this file so there is only one value per gene symbol.
Could I just remove the ones that aren't protein coding? Or take the sum/average for all the transcript variations of a given gene?
Thanks for your advice.
Actually further down it seems there are multiple coding transcripts per gene.
Is there a way to consolidate these? I'm seen RNA-seq data sets with just one TPM per gene symbol but not sure how these files consolidate multiple transcripts.
Thanks for your advice.