Question

differential expression- coding genes

0

Entering edit mode

3.8 years ago

Rob ▴ 170

hello friends,

I performed differential expression analysis with DESeq2 package. my data is HT-seq count data and protein coding genes. in the result TUNAR gene is highly significantly expressed. TUNAR is a non-coding genes. I do not understand why this is in my result because I analyzed only coding genes. for sanity check, I repeated the analysis again and again and the result is the same.

How can I interpret this?

Thanks

rna-seq • 2.8k views

ADD COMMENT • link updated 3.8 years ago by Kevin Blighe 88k • written 3.8 years ago by Rob ▴ 170

1

Entering edit mode

So Rob... Now that you know that it is hard to define TUNAR as coding or not you have to tell us why it is so important for you to know!!!

ADD REPLY • link 3.8 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

I assume you are working on human. TUNAR is a long non-coding RNA which means that the gene is transcribed as RNA, but not translated to protein, so it is perfectly normal that you identify it.

EDIT: I just read that you stated that your data are protein coding genes. So, you should double check the input file, TUNAR is classified as a non-coding RNA.

You may also find useful to consult with biologists in your lab.

ADD REPLY • link 3.8 years ago by Fabio Marroni ★ 3.0k

0

Entering edit mode

Thanks Fabio Marroni Actually I checked many times, but TUNAR is the top gene every time.

ADD REPLY • link 3.8 years ago by Rob ▴ 170

score 4 · Accepted Answer · 2021-02-11

4

Entering edit mode

3.8 years ago

ATpoint 85k

If it is in the results and annotated as lncRNA then you did not properly filter for protein-coding genes, simple as that. How did you do the filtering?

ADD COMMENT • link 3.8 years ago by ATpoint 85k

0

Entering edit mode

Hi ATpoint Thanks for responding.

From Biomart the TUNAR gene annotated as "protein_coding" that is weird. I used this code to filter and just keep the coding genes:

mart <- useMart(biomart = "ensembl", dataset = "hsapiens_gene_ensembl")
mrna_attributes <- getBM(attributes=c("external_gene_name",
                                      "ensembl_gene_id",
                                      "gene_biotype"),
                         filters = c("ensembl_gene_id"),
                         values = rownames(rna),
                         mart = mart)
mrna_attributes <- mrna_attributes[which(mrna_attributes$gene_biotype == "protein_coding"),]
rna <- rna[which(rownames(rna) %in% mrna_attributes$ensembl_gene_id),]

write.csv(rna,"rnaCodinggenes.csv")

Is there any problem with this?

ADD REPLY • link 3.8 years ago by Rob ▴ 170

0

Entering edit mode

No, this looks good indeed. TUNAR here is indeed annotated as a protein_coding gene so either the annotation is wrong (unlikely) or you would need to check your source stating that this gene was non-coding.

ADD REPLY • link 3.8 years ago by ATpoint 85k

score 4 · Accepted Answer · 2021-02-11

Hey,

TUNAR is a protein coding gene:

biomaRt

require(biomaRt)
mart <- useMart(biomart = 'ensembl', dataset = 'hsapiens_gene_ensembl')
mrna_attributes <- getBM(
  attributes = c(
    'external_gene_name',
    'ensembl_gene_id,
    'gene_biotype'),
  mart = mart)

grep('TUNAR', mrna_attributes$external_gene_name)
[1] 34387

mrna_attributes[grep('TUNAR', mrna_attributes$external_gene_name),]
      external_gene_name ensembl_gene_id   gene_biotype
34387              TUNAR ENSG00000250366 protein_coding

Ensembl's record:

http://useast.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG00000250366;r=14:95876392-95925663

The latest GENCODE:

zcat gencode.v36.transcripts.fa.gz | grep -e "ENSG00000250366"
>ENST00000504119.1|ENSG00000250366.3|OTTHUMG00000171392|OTTHUMT00000413256.1|TUNAR-202|TUNAR|2930|protein_coding|
>ENST00000678517.1|ENSG00000250366.3|OTTHUMG00000171392|-|TUNAR-204|TUNAR|3373|protein_coding|
>ENST00000503525.2|ENSG00000250366.3|OTTHUMG00000171392|OTTHUMT00000413257.1|TUNAR-201|TUNAR|3197|protein_coding|
>ENST00000554321.1|ENSG00000250366.3|OTTHUMG00000171392|OTTHUMT00000413258.1|TUNAR-203|TUNAR|767|protein_coding|

zcat gencode.v36.transcripts.fa.gz | grep -e "TUNAR"
>ENST00000504119.1|ENSG00000250366.3|OTTHUMG00000171392|OTTHUMT00000413256.1|TUNAR-202|TUNAR|2930|protein_coding|
>ENST00000678517.1|ENSG00000250366.3|OTTHUMG00000171392|-|TUNAR-204|TUNAR|3373|protein_coding|
>ENST00000503525.2|ENSG00000250366.3|OTTHUMG00000171392|OTTHUMT00000413257.1|TUNAR-201|TUNAR|3197|protein_coding|
>ENST00000554321.1|ENSG00000250366.3|OTTHUMG00000171392|OTTHUMT00000413258.1|TUNAR-203|TUNAR|767|protein_coding|

Kevin

score 3 · Accepted Answer · 2021-02-11

3

Entering edit mode

3.8 years ago

EagleEye 7.6k

Hi,

So far all the studies (publications) are reporting it as long noncoding RNA. I just made a quick analysis to check TUNAR's coding probability and here are the results.

ADD COMMENT • link 3.8 years ago by EagleEye 7.6k

0

Entering edit mode

Nice, EagleEye, Can you show me how you did the analysis? then I can do it with other genes. I have no idea about what you did.

ADD REPLY • link 3.8 years ago by Rob ▴ 170

3

Entering edit mode

You just have to input the sequence of RNA to the CPC tool. It is just quick and easy way to do it. I always use two or three tools like CPC/phyloCF to confirm the coding potential of the transcripts/isoforms after DE analysis (before proceeding to further downstream analysis). Because the genome annotation is becoming complicated day-by-day. Even some protein coding genes have noncoding isoforms and in rare case other way around too.

ADD REPLY • link 3.8 years ago by EagleEye 7.6k

0

Entering edit mode

Thanks EagleEye, Then for pathway and other analyses, how you select the genes from DE results? what percent of being coding gene is satisfying to put it into downstream analyses? Also, how can have access to the sequence of RNA to put that into cpc tool?

ADD REPLY • link 3.8 years ago by Rob ▴ 170

1

Entering edit mode

Then for pathway and other analyses, how you select the genes from DE results? what percent of being coding gene is satisfying to put it into downstream analyses?

Let's say particular gene X has three isoforms/transcripts, Y.1, Y.2 and Y.3. From your differential expression analysis you got particular isoform/transcript Y.2 is differentially expressed but when you check the coding potential it falls in noncoding category according to CPC or phyloCF etc.,. I would eliminate gene X from downstream analysis. In other case if you get at least one of the isoform or transcript Y.1 to be protein coding, I would use that gene X for downstream analysis.

Also, how can have access to the sequence of RNA to put that into cpc tool?

You can extract the transcript sequences using Ensembl BioMart.

ADD REPLY • link 3.8 years ago by EagleEye 7.6k

0

Entering edit mode

Great! Thanks a lot.

ADD REPLY • link 3.8 years ago by Rob ▴ 170