Question

Unexpected gene polymorphism using Salmon-tximeta-DESeq2

0

Entering edit mode

3.9 years ago

raywong.chn ▴ 10

We're analyzing RNAseq data with a pipeline consisting of Salmon, tximeta, and DESeq2.

We have a multi-factorial experimental design, and the experiment was performed on cell lines.

On thing that surprised us is that in the result output, we observe many gene polymorphisms.

For example, for gene NLRP2 we observed multiple entries associated with unique ensembl IDs ENSG00000022556, ENSG00000275082, ENSG00000275843, etc.

baseMean    log2FoldChange  pvalue  padj    gene    CTRL_1  CTRL_2  A_1 A_2 B_1 B_2 A+B_1   A+B_2
ENSG00000022556 559.2711127 -1.709470173    5.51E-09    2.16E-07    NLRP2   33.063154   17.498608   23.790824   28.562371   6.421092    6.755627    29.858583   23.977158
ENSG00000275082 349.6580809 2.406888875 0.592471935 0.817837758 NLRP2   0   7.920205    10.814798   0   18.640884   18.543885   0   3.545411

My question is how do we interpret data like this? And how to deal with this kind of situation? Can we add/average different entries associated with the same gene?

RNA-Seq alignment • 1.0k views

ADD COMMENT • link 3.9 years ago by raywong.chn ▴ 10

1

Entering edit mode

I think the problem is that you simply conducted transcript-level DGE-analysis. What kind of organism are you using? What reference did you use? How did you annotate your transcripts? Maybe you should use tximport to conduct gene-level DGE analyses as recommended in this paper. tximport basically needs a two-column dataframe with transcript ID and gene ID. It then summarises read counts per gene prior to DGE-analyses in DESeq2. I would not recommend to manually summarise counts.

ADD REPLY • link 3.9 years ago by ponganta ▴ 590

4

Entering edit mode

This is unrelated to transcripts, OP is already aggregating to gene level via tximeta. There is ambiguity in the Ensembl annotations towards gene_id (the Ensembl identifiers) and the gene_name (the "trivial" gene name, HGNC). Several Ensembl IDs are mapped to two HGNC names and some to no HGNC name at all. There is no universal rule for this. Sometimes people simply randomly select one of the two (or many), or choose the one with higher avergage expression, or simply keep all. How many of those ambiguous calls do you have?

ADD REPLY • link 3.9 years ago by ATpoint 85k

0

Entering edit mode

@ponganta thanks for the input and @ATpoint thanks a lot for clarifying things up.

There are 490 genes containing calls to multiple genomic loci. The number of ambiguous calls for each gene varies, ranging from 2-7.

ADD REPLY • link 3.9 years ago by raywong.chn ▴ 10

score 1 · Answer 1 · 2021-01-13

1

Entering edit mode

3.9 years ago

swbarnes2 14k

The annotation is what it is. Your first example is located on a real chromosome, the second is on a scaffold, FWIW.

Just keep the ensemble IDs as the primary identifier all the way through. They are unique.

ADD COMMENT • link 3.9 years ago by swbarnes2 14k

0

Entering edit mode

@swbarnes2 Yes you're right. I guess what I'm really concerned about is how to interpret this at the biological level. If we believe these gene polymorphism to be bona fide mapping, how did it happen in a cultured cell line?

ADD REPLY • link 3.9 years ago by raywong.chn ▴ 10

score 1 · Answer 2 · 2021-01-21

1

Entering edit mode

3.9 years ago

raywong.chn ▴ 10

Problem solved.

It turns out that this is due to building the salmon index with ensembl genome fasta, which contains plenty of gene duplicates on haplotype chromosomes.

Switching to GENCODE should resolve the issue, as suggested in this thread: https://support.bioconductor.org/p/p134094/#p134255

ADD COMMENT • link 3.9 years ago by raywong.chn ▴ 10