Hello People,
The following is a problem that I am trying really hard for long time to resolve but can't come up to reasonable solution.
I execute the new tuxedo pipeline to analyze my RNA-seq samples. [Hisat2 -> StringTie2 -> TxImport -> Deseq2]
Now I extract the TPM counts for each transcript from TxImport's object txi$abundance
then total them for each gene. I find that a few genes are in duplicate
Gene_ID Treated1 Treated2 Treated3 Contrl01 Contrl02 Contrl03
ENSG00000233639:PANTR1 0.177826 0.023043 0.270347 0.027134 0.116557 0.165914
ENSG00000233639:PANTR1 0 0 0 0 0 0
ENSG00000235652:AL356599.1 1.218245 0.161862 1.835312 0.662362 0.355804 0.938479
ENSG00000235652:AL356599.1 0 0 0 0 0 0
ENSG00000242759:LINC00882 0.249292 0.367449 0.152916 0.162572 0.096571 0.120887
ENSG00000242759:LINC00882 0 0 0 0 0 0
ENSG00000253853:AC246817.1 1.071109 1.144176 0.726447 1.027518 0.747923 0.72971
ENSG00000253853:AC246817.1 0.197055 0.056572 0.085629 0.214114 0.139897 0.172577
ENSG00000257761:AC078860.2 0.888787 0.967599 0.529618 4.099847 3.220867 2.677969
ENSG00000257761:AC078860.2 0.231985 0.190595 0.022277 0.052071 0.021828 0.070146
Well this might not be a big issue, because I can manually remove them as they are just counts.
Now I perform Deseq's analysis on MSTRG IDs...but on extracting results
, I convert the MSTRG IDs to gene names and the problem starts
GeneID Gene_names baseMean log2FoldChange P-Value
MSTRG.10967 ENSG00000054654:SYNE2 5950.04530060011 0.012667982845789 0.959394280265825
MSTRG.10968 ENSG00000054654:SYNE2 1.41078707122937 -4.23070771833164 0.070293522379558
MSTRG.10175 ENSG00000068650:ATP11A 1833.81055560052 1.56556217491457 8.77220310293335E-10
MSTRG.10176 ENSG00000068650:ATP11A 4.61787244133165 -2.5207751366497 0.038542082326133
MSTRG.30937 ENSG00000070614:NDST1 658.6970054024 1.41142315136311 3.01295144450564E-14
MSTRG.30938 ENSG00000070614:NDST1 1.56573096638691 0.033948989808379 0.985488588961825
MSTRG.11506 ENSG00000100865:CINP 439.00088940909 0.091933520534186 0.786005173471876
MSTRG.11508 ENSG00000100865:CINP 5.63935284665922 -2.05575795178324 0.038659140045149
MSTRG.17621 ENSG00000101558:VAPA 2059.87434998168 0.005182706679134 0.982051464465322
MSTRG.17622 ENSG00000101558:VAPA 3.36185575110659 -2.37245905503554 0.090177894161586
MSTRG.27933 ENSG00000125388:GRK4 148.13570263131 0.922617617272806 0.004572310096685
MSTRG.27935 ENSG00000125388:GRK4 20.9983000186827 -2.74804985745791 2.69737977243546E-05
MSTRG.25158 ENSG00000128254:C22orf24 1.89318586903034 2.02010767441085 0.382214666408296
MSTRG.25159 ENSG00000128254:C22orf24 10.4722908231843 -1.6212485675741 0.026702940921691
MSTRG.15380 ENSG00000141510:TP53 2011.12079411078 1.01082518533637 6.02405522650678E-06
MSTRG.15381 ENSG00000141510:TP53 40.6951300890702 0.578072468755891 0.262835426053252
MSTRG.6817 ENSG00000149256:TENM4 902.877209604941 1.41048092112734 1.2491105757664E-05
MSTRG.6819 ENSG00000149256:TENM4 0.433135937501165 -2.51416321098229 0.532866041255987
MSTRG.4173 ENSG00000150867:PIP4K2A 715.089818003131 -0.452948374471478 0.099058893543589
MSTRG.4174 ENSG00000150867:PIP4K2A 1.59372578144667 -4.40786190776114 0.093908906761754
How can we resolve these redundancy events (as these involve some very important genes)? What is the popular practice?
Thanks in advance
As a comment, if you want gene-level differential analysis then simply skip this unnecessarily-complicated pipeline and use something like
salmon
for quantification of reads against a transcriptome, followed by tximport. Unless you specifically look for novel transcripts you do not need stringtie. It just makes the analysis more complicated, creating issues like this. Don't let yourself be impressed by the fact that it was published in a notable journal, for standard gene-level DE analysis this pipeline is imho not recommended. salmon-tximport is what I would recommend.