How to modify a gff3 file for HTSeq?
1
0
Entering edit mode
9.9 years ago
Gary ▴ 480

I would like to use HTSeq (htseq-count) and edgeR to analysis our alligator RNA-Seq. The alligator gff3 file I download from GIGADB was not accepted by htseq-count as the below. What I need is that there is a gene symbol in the exon type row, e.g.

scaffold-729 AUGUSTUS exon 101305 101913 . - . ID=exon67799;Parent=rna5642;Name=WNT3A

However, there is no gene symbol in the exon type row. The gene symbol I need only appears in the gene type row. Could you teach me how to modify the gff3 file that htseq-count can accept? Many thanks.

Gary

scaffold-729    AUGUSTUS    gene    101305    186845    1    -    .    ID=gene3770;Name=WNT3A;gene=WNT3A;Dbxref=CrocBase:AMISG003770,GeneID:395396,PhylomeDB:Phy004KWLF_ALLMI;Note=WNT3A inferred by phylogenetic tree homology from Gallus gallus EntrezGene:395396 PhylomeDB:Phy004KWLF_ALLMI
scaffold-729    AUGUSTUS    mRNA    101305    186845    .    -    .    ID=rna5642;Name=AMIST005642;transcript_id=AMIST005642;gene=WNT3A;Dbxref=CrocBase:AMIST005642,GeneID:395396,PhylomeDB:Phy004KWLF_ALLMI;Parent=gene3770;Note=WNT3A inferred by phylogenetic tree homology from Gallus gallus EntrezGene:395396 PhylomeDB:Phy004KWLF_ALLMI
scaffold-729    AUGUSTUS    CDS    101434    101913    .    -    0    ID=cd59543;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    106298    106563    .    -    2    ID=cd59544;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    141700    141941    .    -    1    ID=cd59545;Parent=rna5642
scaffold-729    AUGUSTUS    CDS    186490    186560    .    -    0    ID=cd59546;Parent=rna5642
scaffold-729    AUGUSTUS    exon    101305    101913    .    -    .    ID=exon67799;Parent=rna5642
scaffold-729    AUGUSTUS    exon    106298    106563    .    -    .    ID=exon67800;Parent=rna5642
scaffold-729    AUGUSTUS    exon    141700    141941    .    -    .    ID=exon67801;Parent=rna5642
scaffold-729    AUGUSTUS    exon    186490    186845    .    -    .    ID=exon67802;Parent=rna5642
scaffold-729    AUGUSTUS    intron    101914    106297    .    -    .    ID=intron53902;Parent=rna5642
scaffold-729    AUGUSTUS    intron    106564    141699    .    -    .    ID=intron53903;Parent=rna5642
scaffold-729    AUGUSTUS    intron    141942    186489    .    -    .    ID=intron53904;Parent=rna5642
gff3 next-gen RNA-Seq HTSeq • 11k views
ADD COMMENT
2
Entering edit mode

May be you can try -i="Name". See the doc.

ADD REPLY
0
Entering edit mode

Many thanks. However, after trying -i=Name or -i='Name', the htseq-count show an error:

Error occured when processing GFF file (line 6 of file amis_RNASeqSoftware_v1.2.gff3):
Feature exon1 does not contain a Name attribute
[Exception type: ValueError, raised in count.py:53]

I guess that htseq-count only can identify the Name attribute if the Name attribute and the exon type in the same row.

Gary

ADD REPLY
1
Entering edit mode

As GouthamAtla implied, the defaults are appropriate for GTF files from Ensembl. They aren't always applicable to any random GFF file (that's part of the problem with GFF as a format). When something doesn't work, reading the documentation should be your first step.

ADD REPLY
0
Entering edit mode

Thanks. You are right. By default, htseq-count expects a GTF file. I can run htseq-count well with mouse and chicken RNA-Seq, using RefSeq or Ensembl annotation files downloaded from the iGenome. I think my problem is that I don't know how to modify an alligator GFF file to match the format htseq-count need shown in its document.

Gary

ADD REPLY
1
Entering edit mode
9.9 years ago
michael.ante ★ 3.9k

You may have a look at this.

I tried out the "gffread" , as well as the "rtracklayer" approach. Both worked perfectly fine for me.

ADD COMMENT
0
Entering edit mode

Here the problem is htseq-count, by default looks for gene_id attribute for counting. In this case you may just tell htseq-countto take Name instead of gene_id.

ADD REPLY
0
Entering edit mode

Many thanks. However, after trying -i=Name, the htseq-count show an error:

Error occured when processing GFF file (line 6 of file amis_RNASeqSoftware_v1.2.gff3): 
Feature exon1 does not contain a Name attribute
[Exception type: ValueError, raised in count.py:53]

I guess that htseq-count only can identify the Name attribute if the Name attribute and the exon type in the same row.

ADD REPLY
0
Entering edit mode

Thanks you so much. I believe it could be just what I need. I will try to learn the gffread, even using unix command lines is still not easy for me now. Thanks again.

ADD REPLY
0
Entering edit mode

Did you solve the problem?

ADD REPLY

Login before adding your answer.

Traffic: 2304 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6