From StringTie to DESeq2
2
0
Entering edit mode
5.7 years ago
Morris_Chair ▴ 370

Dear all, After generating bunch of files with stringTie command line, I want to use DESeq2 for differential expression analysis. I understand that I have to create a matrix of read counts and the tool to use is prepDE.py,

prepDE.py -i list_gtf.txt

my output is

0 Treated_1
1 Treated_2
Traceback (most recent call last):
  File "/home/miniconda2/bin/prepDE.py", line 257, in <module>
    geneDict[geneIDs[i]][s[0]]+=v[s[0]]
KeyError: 'Treated_2'

Here is my list

[@ws7910 Files.GTF]$ cat list_gtf.txt
Treated_1 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_1.gtf
Treated_2 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_2.gtf
Treated_3 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_3.gtf
Untreated_1 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_1.gtf
Untreated_2 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_2.gtf
Untreated_3 /home/RNAseq/HISAT/BAM/sorted/Files.GTF/Untreated_3.gtf

The stringTie website says that for this purpose I have to use “files generated by StringTie (run with the -e parameter).” I used few command lines with stringtie, which one is it the one with that I was supposed to use the –e parameter? probably the error is due to that ..

my python version is Python 2.7.5

Thanks for support

RNA-Seq • 10k views
ADD COMMENT
1
Entering edit mode

From your stringtie output (coverage of transcripts) use tximport to summarize to the gene level, see the manual, and then use from DESeq2 DESeqDataSetFromTximport(). Manual of DESeq2, read it thoroughly, it is outstandingly comprehensive.

ADD REPLY
1
Entering edit mode

See also Stringtie to DESeq2 and remember to use both google and the search function. These standard questions have been asked many times before. In my experience you learn most (and in a sustainable fashion) if you solve these lowlevel questions yourself by extensive web research. Spoonfeeding it convenient but won't help you in the long term. There are many good resources like blogs and forum posts out there in the web, use them ;-)

ADD REPLY
0
Entering edit mode

HI ATpoint,

I have with me the paper "Transcript-level epxression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown", and for generating the table counts, shows the same command line of the link posted from you.

stringtie -eB -G transcripts.gff <source_file.bam>

What looks weird to me is that I have to create all this files .ctab that are named the same way for each of the bam files and at the end I wonder how can we distinguish one sample to another.

I agree with you when you say to use google for search and believe me that I do so before posting here, often they are very helpful but some other time very confusing. To me asking is also a way to be more confident that I’m doing ok

Thank you for the answer

ADD REPLY
2
Entering edit mode

I haven't used ballgownor stringtie in a while because for well-annotated transcriptomes I never felt the need to assemble transcripts. You might consider quantifying your reads with salmon, then aggregate to gene level with tximport and test for differential expression with DESeq2. This is a well-supported and recommended pipeline, with low computational costs and pretty fast plus salmon has features to correct for GC bias. It is my personal go-to pipeline for standard RNA-seq.

ADD REPLY
0
Entering edit mode

I tried salmon but I got low percentage of mapped reads for a sample, then I decided to go with stringtie.. I don't think to use ballgown anyways

I have an OT: Do you think that DESeq2 and cummRbund do the same job? for example, can I with DESeq2 see the differential expression of specific genes of my interest ? or just the top 10 example

Thanks

ADD REPLY
1
Entering edit mode

If you have low mapping rate this is not a problem with salmon but with your library. If you get higher percentage with hisat2 you should check why that is. Could well be that you have a lot of genomic DNA or rRNA contaminations which salmon will putput as unmapped. read_distribution.py from RSeQC is an option to get an idea where your reads mapped (UTR, exon, intron etc...). What is low percentage? And what is the read length (is it single-or paired-end)? Read length can have a big influence on mapping rate. As for cummRbund it is a downstream tool of cufflinks. I never used it so I cannot give much input what exactly it does but I would stick with the well-maintained DESeq2. If you want to use cummRbund you would need to re-run everything with cufflinks. Do you have human data, and what is the final goal? Differential analysis on the gene level, isoform detection, novel transcripts, alternative splicing?

ADD REPLY
0
Entering edit mode

Hi ATpoint, I answered to my self at some of your question, with low percentage I mean less the 20% of mapping unlike other tools with over 70% and the read lengh=101.
 I have human data, the final goal is to see which genes are upregulated or downregulate along with a set of genes of my interest (that’s why I was asking if DESeq2 can do that). 
Can I detect different isoform expressed from a gene with salmon and DESeq2? this is another aim

To be honest with you at the moment is not a big deal because I’m styding and practicing with RNAseq files that I had in my lab, the files for the actual experiment will come soon and I want to be ready to process’em easily.

ADD REPLY
0
Entering edit mode

Hi ATpoint, although I get low percentage of mapping rate with salmon, I still want to pursue with it

I have found this way for txt import

files <- file.path(dir, "salmon", samples$run, "quant.sf.gz")
names(files) <- paste0("sample", 1:6)
txi.salmon <- tximport(files, type = "salmon", tx2gene = tx2gene)
head(txi.salmon$counts)

my question is, do you rename the quant.sf files before compressing them? otherwise how do you know which belong to each condition?

thank you

ADD REPLY
1
Entering edit mode

read this strintie`s tutorial to know how exactly run prepDE.py

ADD REPLY
0
Entering edit mode

that's what I read and that's why there were the name of the samples in the first colum, just in case you need in the future they are needed ;)

ADD REPLY
0
Entering edit mode

Delete the first colum, the list must specify just the route to each gtf

ADD REPLY
0
Entering edit mode

Hi Buffo, thanks for the answer but I get errors if I delete the first colum

Error: Text file with sample ID and path invalid (/home/RNAseq/HISAT/BAM/sorted/Files.GTF/Treated_1.gtf)
[@ws7910 Files.GTF]$
ADD REPLY
0
Entering edit mode

Hello, may I ask this means if you want both novel transcript analysis and DESeq2 you have to run Stringtie twice? One -e one not.

ADD REPLY
1
Entering edit mode
5.7 years ago

I think you've got the confusion of counts typically being for genes, and StringTie being used for transcript estimations.

However, I see a couple things that may be worth mentioning:

1) In cufflinks, you should merge assemblies from different samples. This should be more robust than any individual assembly. However, while cufflinks has a separate cuffmerge command, Stringtie has a --merge parameter to be called from the stringtie executable.

2) Assuming you have gene symbol annotations, it is possible to calculate unique counts for the gene ID (although it is very important you not use the transcript ID, which will result in many more ambiguous reads, since each isoform will be thought of as an independent gene). Also, unless you are defining a novel gene, I think this arguably makes things unnecessary complicated. For example, you could use known annotations for most genes, and use derfinder to look for regions with no overlap with known genes (although, I would usually start with a conservative set of genes like RefSeq, but you would probably want to use a larger gene set like GENCODE to look for regions with a lack of annotations). Plus, you don't have functional annotations for the novel genes (that would be an advantage to something like MiTranscriptome for lncRNAs, since the name indicates the tumor type enrichment - where you are really quantifying cufflinks annotations from a large number of samples that somebody else processed, for your smaller number of samples).

2b) If you still want to use gene analysis from transcript annotations, summing counts is an option for programs like Salmon. I don't remember if either counts or effective counts are provided for StringTie, but that is theoretically an option.

Similarly, ATpoint had the good suggestion of using tximport for a Salmon quantification.

3) You could arguably use a larger set of exons for something like DEXSeq / JunctionSeq analysis for known genes. However, I think you would still want to primarily want to focus on extensions of known genes. You can also look for novel exons for a selected number of known genes with SGSeq.

ADD COMMENT
1
Entering edit mode

Hi Charles Warden, Thank you for your extensive explanation, My understanding is that the path HISAT,StringTie and Ballgown it's the most updated version of the Tuxedo suite and are used for differential gene expression analysis as well.

All I want to do is learning a workflow to compare the gene level expression of different samples and specific genes of my interest.

ADD REPLY
1
Entering edit mode

You are welcome.

While I thought this was relevant information for the discussion (and the only "answer" instead of a "comment"), I would question whether this should be called the "accepted answer."

If it is your own answer, do you have the option to "moderate" to move a comment to an answer? I think your stringtie -eB -G transcripts.gff is a relevant part of the answer (but I would recommend either using known annotations, the merged assembly, or using a unique gene count instead of stringtie). If you don't have the "moderate" option (and you would like to do this), I would be glad to help.

ADD REPLY
5
Entering edit mode
4.7 years ago
vyazar ▴ 50

Maybe it is a little too late but here is the answer (i.e. a single prepDE.py run on your entire group of GTF files) you are looking for, which was double-checked using single prepDE.py runs on each sample used for the actual analysis and also cross-checked using tximport to see if they yield similar results - which they did.

The problem is, there are transcript IDs that are not present in the first GTF file you described in your list_gtf.txt but show up in the subsequent GTF files - a python dictionary holds these: transcripts IDs being the "key" and sample ID+corresponding count number being the "value" in key-value pairs there. This is the bug for the line 257 error thrown. This is how you fix it:

I. Fire up a text editor (i.e. Atom, nano, vim, etc.) to open the prepDE.py

II. Scroll down to the line 255 and do "exactly" what I did to fix the bug:

252
253    for i,v in t_dict.iteritems():
254 ##        print i,v
254           try:
255               geneDict.setdefault(geneIDs[i],{}) #gene_id
256               geneDict[geneIDs[i]].setdefault(s[0],0)
257               geneDict[geneIDs[i]][s[0]]+=v[s[0]]
258           except:
259               pass
260

III. Make sure the indentation is correct (this is how python compiler understands the curly brackets we use in C to define a scope).

IV. Run the script by: python 2.7 prepDE.py -i list_gtf.txt

V. Now your single prepDE.py run will yield a single gene_count_results.txt and transcript_count_results.txt that will include all the samples in your list_gtf.txt file to generate the count matrix you can use downstream of your analysis pipeline.

As I said, I triple-checked the results and this solution works out just fine. Good luck!

ADD COMMENT
0
Entering edit mode

I signed up to say thank you! This solution works out just fine. Thank you for the fix!!!!

ADD REPLY

Login before adding your answer.

Traffic: 1619 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6