Question

Regarding getting gene count tables for DeSeq from StringTie using prepDE.py

1

Entering edit mode

7.3 years ago

pixie@bioinfo ★ 1.5k

Hello, I followed the pipeline for StringTie and prepDe.py as given exactly from the ballgown directory created as given in http://ccb.jhu.edu/software/stringtie/index.shtml?t=manual#de

However, majority of the transcript IDs given in the StringTie merged file are not present in the gene count tables, They are mostly StringTie IDs (MSTRG). Will it be just okay to take those from the merged file and replace the Stringtie ids in the count matrix ?

This is my merged file:

chr01 StringTie transcript 2983 10815 1000 + 0 gene_id "MSTRG.1" transcript_id "Os01t0100100-01" ref_gene_id "Os01g0100100" chr01 StringTie exon 2983 3268 1000 + 0 gene_id "MSTRG.1" transcript_id "Os01t0100100-01" exon_number "1" chr01 StringTie exon 3354 3616 1000 + 0 gene_id "MSTRG.1" transcript_id "Os01t0100100-01" exon_number "2" chr01 StringTie exon 4357 4455 1000 + 0 gene_id "MSTRG.1" transcript_id "Os01t0100100-01" exon_number "3"

This is my gene count matrix for all the samples:

MSTRG.1 41 86 143 167 304 343 46 51 170 320 44 69 167 102 129 311 310 114 97 301 305 25 62 MSTRG.10 9 6 4 3 6 31 2 4 3 6 3 2 36 2 2 17 11 2 1 5 6 2 6 MSTRG.100 8 13 10 14 14 18 5 4 8 11 0 0 0 0 2 0 6 0 0 4 2 0 0

RNA-Seq stringtie • 5.8k views

ADD COMMENT • link updated 6.0 years ago by kashifahmad750 • 0 • written 7.3 years ago by pixie@bioinfo ★ 1.5k

1

Entering edit mode

I had the same problem, however it is solved for version 1.3.3, merge is not necessary anymore.

ADD REPLY • link 7.1 years ago by Buffo ★ 2.4k

0

Entering edit mode

Thanks for the input, I am planning to re-run using 1.3.3 version

ADD REPLY • link 7.1 years ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

Hi, unfortunately, my issue is not resolved with version 1.3.3. I ran the program, once to create the gtf files, once to merge and once for the ballgown outputs. I then ran PrepDE.py on the ballgown folder to get the gene count matrix. I still get StringTie IDs mostly. Is there a way I can map back the IDs ? I will be grateful if I can email you a part of my data and you could have a look ?

ADD REPLY • link 7.0 years ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

Hello saeed brother, how you did the next step after the 6| Estimate transcript abundances and create table counts for Ballgown, and switched to DEseq. kindly guide me. i am very new to this work. thanks

ADD REPLY • link 6.0 years ago by kashifahmad750 • 0

0

Entering edit mode

This is probably more appropriate as a new question.

ADD REPLY • link 6.0 years ago by WouterDeCoster 47k

score 0 · Answer 1 · 2017-08-28

0

Entering edit mode

7.3 years ago

Satyajeet Khare ★ 1.6k

You can just change the line 25 of prepDE.py from

RE_GENE_ID=re.compile('gene_id "([^"]+)"')

to

RE_GENE_ID=re.compile('transcript_id "([^"]+)"')

And then run prepDE.py. But I am not sure why not generate a transcript count matrix using prepDE as follows...

prepDE.py -i ballgown -g gene_count_matrix.csv -t transcript_count_matrix.csv

transcript_count_matrix.csv should have the transcript IDs.

ADD COMMENT • link 7.3 years ago by Satyajeet Khare ★ 1.6k

0

Entering edit mode

Thank you for the reply. What I had meant was, most of my gene IDs are missing in the gene count tables and are replaced with the StringTie IDs (MSTRG). When I looked into the merged gtf file, I saw that the gene ids (of the transcripts) are not present in the count table, rather the MSTRG IDs are given. I am interested in gene-level analysis only as of now, hence need the gene counts. Can I just map it from the merged gtf file ?

ADD REPLY • link 7.3 years ago by pixie@bioinfo ★ 1.5k

0

Entering edit mode

I am a little confused. MSTRG.1 is a gene id and it is present in your GTF file as well as gene count matrix. Do you mean ref_gene_id? In that case you can replace line 25 with this one...

RE_GENE_ID=re.compile('ref_gene_id "([^"]+)"')

ADD REPLY • link 7.3 years ago by Satyajeet Khare ★ 1.6k

score 0 · Answer 2 · 2017-12-28

0

Entering edit mode

7.0 years ago

Saeed ▴ 10

Hi everyone,

I followed pipeline stringtie then DESeq2 for DE Gene and it is working well. I was wondering, is that possible use transcript_count_matrix.csv to do DE Isoform (alternative splicing) with DESeq2?

ADD COMMENT • link 7.0 years ago by Saeed ▴ 10