Hi, there, I've generated gtf files with Stringie as been explained in Hisat Stringtie Ballgown paper by pertea 2016 doi: 10.1038/nprot.2016.095 https://www.ncbi.nlm.nih.gov/pubmed/27560171
Now, since ballgown doesn't, currently support TPM, I wanted to use DESeq2 in R to do DE analysis,
First I need to extract reads from Stringie generated gtf files.
I found this nice page that explains how to do that with python script that they also provide: http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#deseq
Now, when I run the python script with -i option and provide the list path to python, or without the list option I get an error
python prepDE.py -i phenotype_2.txt
File "prepDE.py", line 32 print "Error: Text file with sample ID and path invalid (%s)" % (line.strip()) ^ SyntaxError: invalid syntax
The script can be checked here: http://ccb.jhu.edu/software/stringtie/dl/prepDE.py
So it doesn't recognize the list and it doesn't see the folders with gtf files, although it is build according to the paper.
Any suggestions are appreciated, thank you.
Not a direct answer to the problem mentioned in your question, but, why not just use your StringTie-generated GTF with your original FASTQ files in order to determine raw read count abundances using
Kallisto
orSalmon
, or use the StringtTie-generated GTF with your aligned BAMs and do the same withfeatureCounts
orBEDTools coverage
. Then, input the raw read count abundances into DESeq2 for normalisation and differential expression analysis...Kevin
That's a nice option I thought about that. Just don't know how it works yet, need to check the documentation on any of those tools, I thought using htseq with the hisat aligned files but I did not find that htseq output TPM too... so basically it's easier to use something that I already made and not to run tools that I don't know and troubleshoot them most of the time till I get the results...I'll do that if I don't have a choice...
Do you definitively need TPM counts?
HTSeq, Kallisto, Salmon, etc., produce raw counts, which are then input to and normalised in EdgeR or DESeq2.
I definitely need TPM counts for single replicates, raw counts are fine to use in DESeq2, for DE purposes only, but when presenting the data you present read counts, and since TPM is more standardized across replicates I do need those and not FPKM
But EdgeR will normalise based on TMM (trimmed mean of M), whilst DESeq2 performs a 'geometric' normalisation. Both programs produce data that is suitable for visualisation, too.
Alright, that's fine, I still need to solve the issue of how to extract the counts from Stringtie, I haven't got yet to the DESeq2 point, so it's not an issue right now, extracting the counts from Stringtie files will produces CSV file with all the features...
Just run featureCounts with the GTF produced by stringTie and call it done. Then you know the counting was done correctly and DESeq2/edgeR/whatever will be happy.
What does your phenotype_2.txt file look like? The error seems to be saying that it is improperly formatted. It should look like the sample on the page you linked, with sample ID and full path separated by a tab.
It is looks like that, And this error happens when the list is not supplied as well, thanks for the reply though...
`
Can you paste a few lines of your phenotype_2.txt?
**EDIT nevermind, look at Kevin's response below. You are using Python3+ and the script is meant for python2.7.
amusingly it says:
but clearly not that "later" 3.0 won't work.
And since there never will be a Python 2.8 this is effectively a Python 2.7 only script.
For the OP install and run it with Python 2 (use conda or some other environment manager to keep the different python's from interfering)
FeatureCounts is not available for windows pc and works only with sam and bam files https://www.rdocumentation.org/packages/Rsubread/versions/1.22.2/topics/featureCounts http://subread.sourceforge.net/
Thanks for all the workaround suggestions, I'm trying to solve one issue to use python provided script to extract the reads from stringtie generated gtf file, please don't suggest any more workarounds...
Okay, I presume that your input file follows the recommended format given on the page to which you've linked: http://ccb.jhu.edu/software/stringtie/dl/sample_lst.txt ?
That Python script is [hopefully] just doing exactly the same as what both Devon and I have been suggesting, i.e., to derive raw counts from your files and using your StringTie-generated GTF as reference, but it does it in an automated way. DESeq2, then, will just produce counts normalised by geometic mean and not TPM.
You should already have TPM counts in your StringTie output.
Yes, TPM counts are there. Thank you I knew that.
Take a look here: Stringtie prepDE.py Error line 32
@Kevin Blighe, that seem to work with python 2.7, I have other issue, that it doesn't see gtf file, I'll try troubleshooting it myself first, Thanks for the help.
@Kevin Great it works. Thank you
Hello dear, i am very new to Linux and stringtie pipeline for RNA seq. i finished the stringtie protocol upto yhe step 6| Estimate transcript abundances and create table counts for Ballgown, after that we can switch to DEseq for differential expression analysis. i much tried but donot understand how to do this steps as mentioned. There are two possible ways to provide input: one option is to provide a directory with the same structure as created in the StringTie protocol paper in preparation for Ballgown. By default (no -i option), this is located in the working directory and named ballgown (see below). ./ballgown/sample1/sample1.gtf ./ballgown/sample2/sample2.gtf ./ballgown/sample3/sample3.gtf
Alternatively, one can provide a textfile listing sample IDs and their respective paths (sample_lst.txt). kindly guide me. i shall be very thankfull
Please do not add multiple reactions in old threads. The best would probably be to open a separate question for this.
Hello anyone can provide the R script for DEseq2 for differential expression analysis using after the stringtie??? thanks in advance
Please do not add multiple reactions in old threads. The best would be to read the DESeq2 manual. There is a lot of documentation. If you have questions after that you should open a separate question for this.