We're analyzing RNAseq data with a pipeline consisting of Salmon, tximeta, and DESeq2.
We have a multi-factorial experimental design, and the experiment was performed on cell lines.
On thing that surprised us is that in the result output, we observe many gene polymorphisms.
For example, for gene NLRP2 we observed multiple entries associated with unique ensembl IDs ENSG00000022556, ENSG00000275082, ENSG00000275843, etc.
baseMean log2FoldChange pvalue padj gene CTRL_1 CTRL_2 A_1 A_2 B_1 B_2 A+B_1 A+B_2
ENSG00000022556 559.2711127 -1.709470173 5.51E-09 2.16E-07 NLRP2 33.063154 17.498608 23.790824 28.562371 6.421092 6.755627 29.858583 23.977158
ENSG00000275082 349.6580809 2.406888875 0.592471935 0.817837758 NLRP2 0 7.920205 10.814798 0 18.640884 18.543885 0 3.545411
My question is how do we interpret data like this? And how to deal with this kind of situation? Can we add/average different entries associated with the same gene?
I think the problem is that you simply conducted transcript-level DGE-analysis. What kind of organism are you using? What reference did you use? How did you annotate your transcripts? Maybe you should use tximport to conduct gene-level DGE analyses as recommended in this paper. tximport basically needs a two-column dataframe with transcript ID and gene ID. It then summarises read counts per gene prior to DGE-analyses in DESeq2. I would not recommend to manually summarise counts.
This is unrelated to transcripts, OP is already aggregating to gene level via tximeta. There is ambiguity in the Ensembl annotations towards gene_id (the Ensembl identifiers) and the gene_name (the "trivial" gene name, HGNC). Several Ensembl IDs are mapped to two HGNC names and some to no HGNC name at all. There is no universal rule for this. Sometimes people simply randomly select one of the two (or many), or choose the one with higher avergage expression, or simply keep all. How many of those ambiguous calls do you have?
@ponganta thanks for the input and @ATpoint thanks a lot for clarifying things up.
There are 490 genes containing calls to multiple genomic loci. The number of ambiguous calls for each gene varies, ranging from 2-7.