Edit: The other answer is probably a more proper way to get the gene-level matrix, using gexpr
https://www.rdocumentation.org/packages/ballgown/versions/2.4.2/topics/gexpr and should probably be marked as accepted. I'll leave mine here in case the info is useful, but I wrote my answer thinking of transcript per sample matrices. texpr
seems to do that as well.
I think you can get your FPKMs with Tablemaker
per sample with your ballgown output, but you might have to assemble your matrix yourself (should be very few lines of Bash and/or R).
See https://bioconductor.org/packages/devel/bioc/vignettes/ballgown/inst/doc/ballgown.html
Under Running Tablemaker
tablemaker -p 4 -q -W -G merged.gtf -o sample01_output read_alignments.bam
The output is 5 files, written to the specified output directory:
t_data.ctab: transcript-level expression measurements. One row per transcript. Columns are:
- ...
- FPKM: Cufflinks-estimated FPKM for the transcript (available for each sample)
Here's an example t_data.ctab
: https://github.com/alyssafrazee/ballgown/blob/master/inst/extdata/sample02/t_data.ctab
Note that each sample should have it's own t_data.ctab
file, so all you need to do it grab the FPKM column for each sample and merge them together.
You can get the relevant info from the ctab with cut
$ cut -f6,12 sample1.ctab
t_name FPKM
TCONS_00000010 800.706
TCONS_00000017 715.775
TCONS_00000020 579.569
TCONS_00000598 2205.3
TCONS_00000024 304.873
TCONS_00000613 101.848
TCONS_00000029 87.7438
TCONS_00000032 0.0249079
TCONS_00000637 323.543
And join with join
$ join -j1 <(cut -f6,12 sample1.ctab | head | sort -k1) <(cut -f6,12 sample2.ctab | head | sort -k1)
TCONS_00000010 800.706 800.706
TCONS_00000017 715.775 715.775
TCONS_00000020 579.569 579.569
TCONS_00000024 304.873 304.873
TCONS_00000029 87.7438 87.7438
TCONS_00000032 0.0249079 0.0249079
TCONS_00000598 2205.3 2205.3
TCONS_00000613 101.848 101.848
TCONS_00000637 323.543 323.543
For more than 2 files, you can join the first 2, then join the joined output with the next file etc. Or this thread had more elaborate solutions: https://stackoverflow.com/questions/10726471/join-multiple-files
Or R could do something similar if you load them as dataframes and used merge(all.x=T, all.y=T)
iteratively.
Whatever you end up doing, make sure that you test your results. Check how many rows you end up with. I'm not familiar with the stringtie ctab files, so I'm not sure if they would all have the same number of rows or different; that's something to check too. Spot check some transcripts, ideally a random sample for top, bottom, middle of your files; make sure they match with the appropriate sample values.
Hope it helps!
Why do you need FPKM units? These are derived from an outdated form of normalisation for RNA-seq, and effectively renders your samples incomparable in terms of differential expression analysis.
The idea of using prepDE is to generate 'raw' counts for the purpose of using other [more robust] differential expression analysis tools, like EdgeR and DESeq2.
Thanks for reminding. I didn't intend to feed the count matrix to EdgeR or DESeq2, so I just needed the matrix in the FPKM form. Is there something wrong with the FPKM normalization method?
Yes, it renders the samples non-comparable in terms of differential expression analysis (you can compare them, obviously, but the false-positives / -negatives will be higher than other methods). Not a life or death situation, for now, but just be aware that there are proven better methods.
Thanks for the tips. What are the better metrics used now? TPM or some others?
TPM is better but still not ideal. You should use those methods employed by either DESeq2 or EdgeR. Here is the manuscript that I had wanted to share with you (it is a very well cited manuscript):
An update (6th October 2018):
You should abandon RPKM / FPKM. They are not ideal where cross-sample differential expression analysis is your aim; indeed, they render samples incomparable via differential expression analysis:
Please read this: A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis
Also, by Harold Pimental: What the FPKM? A review of RNA-Seq expression units
I see, thank you very much!