Question

StringTie cannot be run as a problem in naming convention

0

Entering edit mode

6.9 years ago

nazaninhoseinkhan ▴ 530

Hi all, I know this question has been asked previously, however I still cannot solve my problem.

I have obtained both gtf and corresponding fasta files from Pseudomonas aeruginosa database.

However the StringTie cannot be run and ends with this error message:"WARNING: no reference transcripts were found for the genomic sequences where reads were mapped! Please make sure the -G annotation file uses the same naming convention for the genome sequences."

The header of the fasta and GTF files is as follow:

gi|116048575|ref|NC_008463|pseudocap|138 [Pseudomonas aeruginosa UCBPP-PA14 chromosome, complete genome.] TTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCAGCGATGACGTAATAGATAGATACAAGGAAGTCATTTTTCTTTTAAAGGATAG

chromosome PseudoCAP CDS 483 2027 . + 0 gene_id "PA14_00010"; transcript_id "1650836"; locus_tag "PA14_00010"; name "dnaA ,chromosomal replication initiation protein"; replicon_xref "NC_008463" chromosome PseudoCAP CDS 2056 3159 . + 0 gene_id "PA14_00020"; transcript_id "1650838"; locus_tag "PA14_00020"; name "dnaN ,DNA polymerase III subunit beta"; replicon_xref "NC_008463"

I will appreciate any help in advance

Nazanin Hosseinkhan

StringTie Gene naming convention GTF • 2.2k views

ADD COMMENT • link updated 6.9 years ago by jean.elbers ★ 1.7k • written 6.9 years ago by nazaninhoseinkhan ▴ 530

0

Entering edit mode

If you are experiencing command line issues, you may be interested in DEWE (http://www.sing-group.org/dewe), a GUI to execute differential expression analyses that also allows you to use StringTie separately. Regards.

ADD REPLY • link 6.9 years ago by Hugo ▴ 380

score 0 · Answer 1 · 2018-02-04

You need to make sure that the GTF and FASTA files come from the same source to ensure compatible headers. It looks like the FASTA file has a very different header for chromosome than GTF file.

FASTA file seqname/chromosome

gi|116048575|ref|NC_008463|pseudocap|138 [Pseudomonas aeruginosa UCBPP-PA14 chromosome, complete genome.]

GTF file seqname/chromosome

chromosome

One thing you could do is manually change the FASTA header to >chromosome (if you don't want to write a regular expression to change the GTF file's contents). Note that this is assuming that the FASTA and GTF files are indeed from the exact same annotation run.

score 0 · Answer 2 · 2018-02-04

I don't have a gtf file to test this one on StringTie, but this is how you would change chromosome to gi|116048575|ref|NC_008463|pseudocap|138 [Pseudomonas aeruginosa UCBPP-PA14 chromosome, complete genome.] in the first column throughout the gtf file.

awk -F'\t' -v OFS='\t' '{sub(/chromosome/, "gi\|116048575\|ref\|NC_008463\|pseudocap\|138 \[Pseudomonas aeruginosa UCBPP-PA14 chromosome, complete genome.\]", $1)} 1' name-of-gtf-file.gtf > name-of-new-gtf-file.gtf

I don't know if the BAM file truncated the FASTA header after the first space following "138", so here is another replacement string that might be required by StringTie

awk -F'\t' -v OFS='\t' '{sub(/chromosome/, "gi\|116048575\|ref\|NC_008463\|pseudocap\|138", $1)} 1' name-of-gtf-file.gtf > name-of-new-gtf-file.gtf