Question

How to add a specific column from related table

1

Entering edit mode

5.7 years ago

Redse ▴ 30

Hi,

I am a newbie in R and RNA-Seq. I use StringTie to assemble my RNA-Seq data and Ballgown for DEGs. I have been trying to add data from related table to the data.frame. Below is the sample steps:-

pheno_data = read.csv("Den.csv")

Below is the content of my Den.csv:

"ids","pgroup","type"   
"RS5206790","Res","IF"   
"RS5206791","Res","NIF"   
"RS5206792","Res","IF"   
"RS5206795","Res","NIF"  
"RS5206798","Res","IF"  
"RS5206799","Res","NIF"  
"RS5206828","Sus","IF"  
"RS5206829","Sus","NIF"  
"RS5206830","Sus","IF"  
"RS5206831","Sus","IF"  
"RS5206832","Sus","NIF"  
"RS5206833","Sus","NIF"

Then, i merge it with my results generated by StringTie:

  bg_tb <- ballgown(dataDir = "../ATB/Ballgown", samplePattern = "RS", pData = pheno_data)

Then, i knock out the low abundance genes:

bg_tb_filt= subset(bg_tb, "rowVars(texpr(bg_tb)) > 1", genomesubset=TRUE)

and identify transcripts:

results_transcripts = stattest(bg_tb_filt, feature = "transcript", covariate = "pgroup", adjustvars = c("type"), getFC = TRUE, meas = "FPKM")

then, add genes info:-

results_transcripts = data.frame(geneNames=ballgown::geneNames(bg_tb_filt), geneIDs=ballgown::geneIDs(bg_tb_filt), results_transcripts)

and the sample results as follows:-

geneNames geneIds  feature  id  fc  pval  qval  
MTND1P23  MSTRG.30  transcript  89  1.2628495  0.185639798  0.59743102  
MTND2P28  MSTRG.31  transcript  90  1.3679515  0.038550274  0.34349762  
MTCO1P12  MSTRG.32  transcript  91  1.2878662  0.102384645  0.50014745  
MTCO2P12  MSTRG.34  transcript  93  0.8824411  0.544662788  0.83330385  
AL6698317 MSTRG.35  transcript  116  1.1581505  0.268141448  0.67138119

The issue is I need to add "pgroup" data (Den.csv) into the above sample result. is it possible? I want to plot the genes according to the pgroup and I have been working on it for many days but to no avail.

Would appreciate your kind help and advise. Thanks

R RNA-Seq bioconductor ballgown • 1.4k views

ADD COMMENT • link 5.7 years ago by Redse ▴ 30

0

Entering edit mode

Can you provide a little more context? I.e., what are the columns of Den.csv representing (samples? genes?) It might also help to see the structure of the intermediate files (e.g. str(results_trasnscripts), str(bg_tb) etc., after each step). Currently it is not clear to me how the genes are related to the entries of Den.csv

ADD REPLY • link 5.7 years ago by Friederike 9.0k

0

Entering edit mode

Hi Friederike,

The first column (ids) of Dens.csv is representing RNA-Seq samples. There are 12 samples altogether. The expressed genes/transcripts that were generated by StringTie came from these samples.

The structure of the files as follows:-

str(results_transcripts)
'data.frame': 37502 obs. of 7 variables:
$ geneNames: Factor w/ 11391 levels ".","A1BG","A2M",..: 1 6161 6162 6115 6122 6111 6109 6125 1178 1 ...
$ geneIDs : Factor w/ 11995 levels "ENSG00000225154.2",..: 10786 8571 9017 9144 9216 9252 9252 9252 9570 5951 ...
$ feature : Factor w/ 1 level "transcript": 1 1 1 1 1 1 1 1 1 1 ...
$ id : Factor w/ 37502 levels "10000","10001",..: 24581 35387 35576 35757 36172 36355 36526 36680 2723 14362 ...
$ fc : num 1.109 1.263 1.368 1.288 0.882 ...
$ pval : num 0.552 0.1856 0.0386 0.1024 0.5447 ...
$ qval : num 0.837 0.597 0.343 0.5 0.833 ...

As for str(bg_tb_filt), the content is too big to put it here so I removed some of it and did the screenshot below:

https://ibb.co/9VwYz2Q

the Dens.csv data can be seen under @indexes>$pData

thanks.

ADD REPLY • link 5.7 years ago by Redse ▴ 30

0

Entering edit mode

I'm still a bit lost. This line seems to extract the logFC for comparing the two pgroups you have (Res, Sus): results_transcripts = stattest(bg_tb_filt, feature = "transcript", covariate = "pgroup", adjustvars = c("type"), getFC = TRUE, meas = "FPKM")

So, why would you want to add the individual experiment information to this since the logFC are the results of the comparison of multiple experiments? Can you perhaps sketch out the ideal table you'd want in the end?

ADD REPLY • link 5.7 years ago by Friederike 9.0k

0

Entering edit mode

It is because in my current results, I do not know how to compare the deferentially expressed transcripts between the two groups. For instance, in my results_transcripts above, based on fc values, MTCO2P12 is expressed at a low level. How can I know whether it is belongs to Res or Sus?

The table that i was thinking is to include additional column indicating whether the transcripts belongs to Res or Sus. For instance (table 1):

pgroup geneNames geneIds feature id fc pval qval
Res MTND1P23 MSTRG.30 transcript 89 1.2628495 0.185639798 0.59743102
Res MTND2P28 MSTRG.31 transcript 90 1.3679515 0.038550274 0.34349762
Res MTCO1P12 MSTRG.32 transcript 91 1.2878662 0.102384645 0.50014745
Sus MTCO2P12 MSTRG.34 transcript 93 0.8824411 0.544662788 0.83330385
Res AL6698317 MSTRG.35 transcript 116 1.1581505 0.268141448 0.67138119

However, it would be great if there is another way to get the results that I wanted without generating a new table like Table 1 above.

Also, I need to draw a volcano plot for each of the pgroup (Res & Sus) individually in order to compare the significance of differential expression in both Res & Sus against the fc.

thanks

ADD REPLY • link 5.7 years ago by Redse ▴ 30