Converting gff file to gtf for htseq-count
0
0
Entering edit mode
8.7 years ago
natsterbug ▴ 10

After running TopHat2/2.1.0 on RNA-seq SE 50bp reads from S.tuberosum, I am now attempting to count reads mapping to each feature using htseq-count. Using the following command:

htseq-count -m intersection-nonempty --format=bam \
tophat_Kalkaska_control/tophat_K10C/accepted_hits.bam \
PGSC_DM_V403_genes_strand_filtered.gff

I receive the following error message:

Error occured when processing GFF file (line 3 of file PGSC_DM_V403_genes_strand_filtered.gff):
  Feature PGSC0003DME400103709 does not contain a 'gene_id' attribute
  [Exception type: ValueError, raised in count.py:53]

My understanding is that htseq is expecting a gtf file rather than the gff file I supplied. I would like to convert my gff file to gtf or modify the 9th column of the gff. A sample of my gff file is below:

##gff-version   3
ST4.03ch01      Cufflinks       mRNA    152322  153489  .       -       .       ID=PGSC0003DMT400039136;Parent=PGSC0003DMG400015133;Source_id=RNASEQ26.809.0;Mapping_depth=16.192011;Class=4;name="Defensin"
ST4.03ch01      Cufflinks       exon    153389  153489  .       -       .       ID=PGSC0003DME400103709;Parent=PGSC0003DMT400039136
ST4.03ch01      Cufflinks       exon    152322  152593  .       -       .       ID=PGSC0003DME400103710;Parent=PGSC0003DMT400039136
ST4.03ch01      Cufflinks       intron  152594  153388  .       -       .       ID=PGSC0003DMI400065839;Parent=PGSC0003DMT400039136
ST4.03ch01      BestORF CDS     152418  152576  .       -       0       ID=PGSC0003DMC400026563;Parent=PGSC0003DMT400039136;name="Defensin"
ST4.03ch01      GLEAN   mRNA    160499  160663  .       -       .       ID=PGSC0003DMT400039133;Parent=PGSC0003DMG400015132;Source_id=PGSC0003DMG000019750;Class=2;name="Defensin"
ST4.03ch01      Cufflinks       mRNA    160379  161885  .       -       .       ID=PGSC0003DMT400039134;Parent=PGSC0003DMG400015132;Source_id=RNASEQ26.803.0;Mapping_depth=35.840147;Class=2;name="Defensin"
ST4.03ch01      Cufflinks       exon    161722  161885  .       -       .       ID=PGSC0003DME400103705;Parent=PGSC0003DMT400039134
ST4.03ch01      GLEAN   exon    160499  160663  .       -       .       ID=PGSC0003DME400103707;Parent=PGSC0003DMT400039133

Is gffread PGSC_DM_V403_genes_strand_filtered.gff -T -o PGSC_DM_V403_genes_strand_filtered.gtf the appropriate course of action? Thanks, Natalie

RNA-Seq sequence htseq-count gff gtf • 4.2k views
ADD COMMENT
0
Entering edit mode

I don't recall all the features to gffread but it sounds about right. What do you get as a result?

ADD REPLY
0
Entering edit mode

I apologize for the extremely tardy response. Below is the output:

ST4.03ch00 GLEAN exon 63411 63498 . + . transcript_id "PGSC0003DMT400089830"; gene_id "PGSC0003DMG400039401"; ST4.03ch00 GLEAN exon 66359 66816 . + . transcript_id "PGSC0003DMT400089830"; gene_id "PGSC0003DMG400039401"; ST4.03ch00 GLEAN CDS 63411 63498 . + 0 transcript_id "PGSC0003DMT400089830"; gene_id "PGSC0003DMG400039401"; ST4.03ch00 GLEAN CDS 66359 66816 . + 2 transcript_id "PGSC0003DMT400089830"; gene_id "PGSC0003DMG400039401"; ST4.03ch00 GLEAN exon 70051 70281 . + . transcript_id "PGSC0003DMT400036367"; gene_id "PGSC0003DMG400013996"; ST4.03ch00 GLEAN exon 72021 73032 . + . transcript_id "PGSC0003DMT400036367"; gene_id "PGSC0003DMG400013996"; ST4.03ch00 GLEAN exon 73103 73227 . + . transcript_id "PGSC0003DMT400036367"; gene_id "PGSC0003DMG400013996"; ST4.03ch00 GLEAN CDS 70051 70281 . + 0 transcript_id "PGSC0003DMT400036367"; gene_id "PGSC0003DMG400013996"; ST4.03ch00 GLEAN CDS 72021 73032 . + 0 transcript_id "PGSC0003DMT400036367"; gene_id "PGSC0003DMG400013996"; ST4.03ch00 GLEAN CDS 73103 73227 . + 2 transcript_id "PGSC0003DMT400036367"; gene_id "PGSC0003DMG400013996";

ADD REPLY

Login before adding your answer.

Traffic: 2248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6