Stringtie output files
1
1
Entering edit mode
6.5 years ago
chipolino ▴ 150

I am a new user of StringTie and probably this question is very simple but I still don't get it... I have my sorted bam files (HISAT2 output, genome v19) and here is my StringTie command (v1.3.4):

stringtie hisat2_work/hisat2/alignments.sorted.bam -o stringtie_results/transcripts.gtf -G genes.GRCh37.gtf --rf -A stringtie_results/gene_abund.tab

As a result I have two output files: gene abundances (gene_abund.tab) and transcript annotation file (transcripts.gtf). For example, if I open gene_abund.tab, I will see this line:

Gene ID Gene Name   Reference   Strand  Start   End Coverage    FPKM    TPM
ENSG00000223972 DDX11L1 1   +   11869   14412   0.180934    0.129907    0.341143

But if I search for gene name (and gene id) DDX11L11 in transcripts.gtf I don't see it, it's absent. At the same time, I can find other genes from gene_abund.tab in transcripts.gtf, for example:

line in gene_abund.tab:

ENSG00000227232 WASH7P  1   -   14363   29806   16.906973   12.345821   32.420803

corresponding line in transcripts.gtf:

StringTie   transcript  14363   29370   1000    -   .   gene_id "STRG.2"; transcript_id "STRG.2.2"; reference_id "ENST00000423562"; ref_gene_id "ENSG00000227232"; ref_gene_name "WASH7P"; cov "1.478912"; FPKM "1.061831"; TPM "2.788425";

What can be a problem here, why do I miss some genes from gene_abund.tab in my transcripts.gtf file?

RNA-Seq Stringtie GTF • 7.1k views
ADD COMMENT
0
Entering edit mode

Hello and welcome to biostars,

to show commands you use and file contents you should use the code button (the one with 101 010). This makes your post much more readable.

This time I did it for you.

fin swimmer

ADD REPLY
1
Entering edit mode
6.5 years ago

The one that was not included has coverage that falls below the threshold. It is virtually not expressed at all.

Modify the -C and -c parameter to StringTie:

-C <cov_refs.gtf> StringTie outputs a file with the given name with all transcripts in the provided reference file that are fully covered by reads (requires -G).

-c <float> Sets the minimum read coverage allowed for the predicted transcripts. A transcript with a lower coverage than this value is not shown in the output. Default: 2.5

Kevin

ADD COMMENT
0
Entering edit mode

I should additionally point out that DDX11L1 is a pseudogene. So, it makes sense that it may have minimal expression if it has no promoter sequence or TSS such that transcription at a meaningful level could occur.

ADD REPLY

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6