Subset Function in Ballgown
2
0
Entering edit mode
5.1 years ago
Morgan S. ▴ 90

Hello,

I am using the HISAT, Stringtie, and Ballgown pipeline to do transcriptome expression analysis. So far I went through these steps.

bg=ballgown(dataDir=data_directory, samplePattern='1and2', meas='all')

bg_ifungis = ballgown(dataDir = data_directory, samplePattern = '1and2', pData=pheno_data)

bg_ifungis_filt = subset(bg_ifungis,"rowVars(texpr(bg_ifungis)) >1",genomesubset=TRUE)

results_transcripts = stattest(bg_ifungis_filt, feature="transcript",covariate="timepoint",adjustvars = c("location"), getFC=TRUE, meas="FPKM")

So here, I set adjustvars to the location column in my pData, which is either Earth or ISS. However, here I have compared all of Earth samples to all of ISS, and now I would like to compare each separately, on their own.

I first tried changing the results_transcripts code to this

results_transcripts_E = stattest(bg_ifungis_filt, feature="transcript",covariate="timepoint",adjustvars = c("location=Earth"), getFC=TRUE, meas="FPKM")

Did not work, said Earth is not a valid covariate. Then I tried to subset the data using ballgowns subset command

bg_ifungis_Earth = subset(pheno_data,"pheno_data$location == Earth",genomesubset=FALSE)

> Error in subset.data.frame(pheno_data, "pheno_data$location == Earth",  : 
  'subset' must be logical

I tried variations of the above, but kept getting a similar error. Is there any way I can subset my data by location in Ballgown? Or am I going to have to re-do the Stringtie assemblies and everything so that Earth and ISS are treated separately?

I hope that makes sense!

Thanks in advance, Morgan

transcriptome differential expression • 2.2k views
ADD COMMENT
1
Entering edit mode
4.4 years ago
zhhxu9 ▴ 20

You need to do this for subsetting different group of samples:

bg_ifungis_Earth = subset(bg_ifungis_filt, "location == 'Earth'",genomesubset=FALSE)

Note that:
The bg object is the one you want to subset from.
The condition expression is in double quote.
The feature name location from the pData object can be used directly, and don't need a single quote.
The matching string Earth here, need a single quote around it.

Hope this helps!

ADD COMMENT
0
Entering edit mode
5.1 years ago
Mark ★ 1.6k

My apologies I haven't used ballgown so I'm mostly reading and guessing.

I skimmed the manual/tutorial quickly and noticed that you need to specify timecourse = TRUE when performing time series. The other thing I noticed is that the covariate essentially tells ballgown the grouping to test (control/case). So you're telling ballgown covariate="timepoint" so treat the time points as groups and to adjust for location via adjustvars=c("location").

From what I understand adjustvars is for handling cofounding factors NOT for defining the groupings to test (that's why you get an error when specifying c("location=Earth"), it's looking for a variable called "location=Earth" which doesn't exist). So in your first analysis I think you aren't comparing earth vs ISS, you're actually performing a time-series experiment over all your data and telling ballgown that the location is a confounding factor. I think because you haven't set timecourse = TRUE it's treating the samples as multigroup comparison and not time-series analysis like youre expecting. Now I'm not sure if this interpretation is correct because again I haven't used ballgown. So double check this. The tutorial is very good and describes how to perform time-series analysis.

For subsetting I think your command should be: subset(pheno_data,"location == Earth",genomesubset=FALSE). I think you need to repeat your analysis with the subsetted datasets (one for earth and one for ISS), then run ballgown with covariate="timepoint" and also specify timecourse=TRUE (check the manual for this). Leave the adjustvars unless you there's a cofounding factor to adjust for. So you'll have two time-series DE datasets that you will have to compare.

Then I would recommend performing another analysis comparing directly all Earth vs all ISS data. Without seeing your data structure, I think you need to define a new variable called group with values identifying which datasets are Earth/ISS. Then run this 2-group DE analysis to see if all Earth differ from all ISS transcripts.

Now it's possible you want to perform a time-series experiment of all earth and ISS datasets and to adjust for the location. But to be honest I'm not sure what you'd want to answer with this experimental setup. Something I should have mentioned right at the start is that I don't know your aims/hypothesis so I'm totally guessing here at what your objective is.

P.S check getFC=TRUE option in the manual, I don't think it's available for time-series analysis.

ADD COMMENT

Login before adding your answer.

Traffic: 2569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6