Question

Gene names in Ballgown differential expression analysis

11

Entering edit mode

8.3 years ago

Albert ▴ 180

Hi there,

I just performed an RNAseq experiment, for which I used HISAT2 for the alingment, Stringtie for the assembly and the R package Ballgown for the Differential Expression (DE) analysis (protocol published here: http://www.nature.com/nprot/journal/v11/n9/full/nprot.2016.095.html ).

The default output of the DE analysis is a table which looks pretty standard, with a transcript ID, a fold change value, a p-value and a q-value. The transcript ID corresponds to either an Ensmbl ID or a MSTRG ID, which is basically a number that is generated during the assembly of the transcriptome. However, the gene name/symbol corresponding to each feature does not appear in the table.

The protocol I followed provides a command to add the gene names to the results table:

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_chrX_filt), geneIDs=ballgown::geneIDs(bg_chrX_filt), results_transcripts)

And this actually works when the DE analysis is performed at the transcript level, but not when it is performed at the gene level. The reason is basically that the table resulting from a gene level analysis has many less rows than the original reference which links every transcript ID to a gene name, and R does not like that.

I was wondering if anyone had the same issue before, and if there's some kind of straightforward solution for it. It could virtually be done manually, but it would be a nightmare to add ~20k gene names that way! :S

Thanks in advance for your help!

Albert

RNA-Seq ballgown • 14k views

ADD COMMENT • link updated 7.8 years ago by mcalleja ▴ 110 • written 8.3 years ago by Albert ▴ 180

0

Entering edit mode

I'm having this problem too and fixed the annotation the same way as Albert but still trying to figure out how to do everything at the gene level.

ADD REPLY • link 8.3 years ago by marina.v.yurieva ▴ 580

0

Entering edit mode

Hi, does anyone figured out how to add the gene names at the gene level?

ADD REPLY • link 8.2 years ago by zhifeiluo1208 • 0

score 9 · Answer 1 · 2017-04-24

9

Entering edit mode

7.8 years ago

mcalleja ▴ 110

Hi!

I want to contribute the answer that AlyssaFrazee gave on Ballgown's github to retrieve GenesName (it's not intuitive, but rather a good workaround) Add geneNames to result of stattest on feature gene:

indices <- match(results_genes$id, texpr(bg, 'all')$gene_id)
gene_names_for_result <- texpr(bg, 'all')$gene_name[indices]
results_genes <- data.frame(geneNames=gene_names_for_result, results_genes)

ADD COMMENT • link 7.8 years ago by mcalleja ▴ 110

0

Entering edit mode

play around these commands and you will find the answers.

ADD REPLY • link 5.8 years ago by ranjanmanishblue • 0

score 2 · Answer 2 · 2017-02-20

There are gene names (symbols) and gene ids in results_transcripts as shown. So you can map them to results_genes directly. Note some ids have more than one gene names.

# get the mapping relationship between gene symbols and gene ids
genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)
# join them by gene ids.
results_genes1 <- dplyr::left_join(results_genes, genename_ids, by=c("id"="geneIDs"))

score 0 · Answer 3 · 2016-10-28

0

Entering edit mode

8.3 years ago

jerryzzy11 • 0

I am having the same problem here. if you get the answer elsewhere, please let me know. Thank you! BTW, do you know how to get the gene names with the MSTRG.xxxx.x?

ADD COMMENT • link 8.3 years ago by jerryzzy11 • 0

2

Entering edit mode

Hi,

you can get a table with all the information in your ballgown object (bg) , doing:

full_table <- texpr(bg , 'all')

The table will include: transcript id, chromosome, strand, start positon, end position, number of exons, length, gene id, gene name, and the FPKM values for each sample.

Btw, I have noticed that sometimes the same gene id (MSTRG.xxxx) can correspond to more than one gene name, so don't fully rely on the gene_id as a unique identifier.

Hope this helps!

ADD REPLY • link 8.3 years ago by Albert ▴ 180

score 0 · Answer 4 · 2017-04-05

It doesn't really work. I did next:

bg_filt = subset(bg,"rowVars(texpr(bg)) >5",genomesubset=TRUE)

results_transcripts = stattest(bg_filt, feature="transcript", covariate="Treatment", adjustvars = c("Age"), getFC=TRUE, meas="FPKM")

When I run the proposed

genename_ids <- dplyr::filter(results_transcripts, geneNames!=".") %>% dplyr::distinct(geneNames, geneIDs)

I got next message: Error in eval(expr, envir, enclos) : comparison (2) is possible only for atomic and list types