Htseq-count problem with gtf/gff file
0
0
Entering edit mode
3 months ago

Hi,

I have been trying to analyse RNA-seq data from a bacterial isolate for which I have the genome sequenced. I have used hisat2 for the alignment and am now trying to get count data using HTseq, however, I seem to have problems with the gtf/gff file.

I have started by using the gtf file, but I get the error:

Warning: No features of type '{args.feature_type}' found.
The alignment file has no chromosomes in common with the GFF/GTF file.

I have checked that the chromosomes (in this case they would be different contigs) are called the same in the bam and gtf file, and to me it looks like they do (see below for reference)

GTF:

.47320_2_326.1  GenBank transcript      207     524     .       -       .       transcript_id "47320_2#326.contigs_spades_00001"; gene_id "47320_2#326.contigs_spades_00001"

.47320_2_326.1  GenBank CDS     207     524     .       -       1       transcript_id "47320_2#326.contigs_spades_00001";
.
47320_2_326.1  GenBank transcript      849     1991    .       +       .       transcript_id "47320_2#326.contigs_spades_00002"; gene_id "47320_2#326.contigs_spades_00002"

BAM/SAM

.47320_2_326.1  1293343 2500979 0
.47320_2_326.2  389930  614450  0
.47320_2_326.3  278970  234496  0
.47320_2_326.4  231691  394048  0

Since this was not working, I have also tried using the gff file, but I get the following error:

Error processing GFF file (line 3049 of file 47320_2#326.contigs_spades.gff):
  not enough values to unpack (expected 9, got 1)
  [Exception type: ValueError, raised in features.py:138]

I am no expert in bioinformatics, so any help would be very much appreciated!

htseq • 836 views
ADD COMMENT
0
Entering edit mode

not enough values to unpack (expected 9, got 1)

please, show us the output of

awk -F '\t' '/^#/ {next;} (NF!=9) {print;}' /path/to/gff  | head | tr "\t" "#"
ADD REPLY
0
Entering edit mode

Hi Pierre,

Thanks for the help! I get this:

>.47320_2_326.1
CGCCTTTCCAATTATTCTCTTGGAAGTAGCCTTGTTGAACCATCTCCTCAACTATATTCC
AATAATTATCTTTAGTTTTTTGAACACCTAAAGAATAGTTGAATACAAACCTACAACATC
CAATAGTCTTGTTAATTAACTCGGCTTGTTTCTTAGTTGGGTATATTCTAAATTTATATG
CTTTCTTTCTTATCACTTGTCTCACCTCACTTTACTTGTTGATTTTGAATATAGTTTCTA
ATTTGTTCTTCTGTATTCTCTGATACAGTAGCTACAAAATAGCTTGGATTCCATAGGTTT
CCTTCCCATAGATATCTTTTAATTTCAGGATGTTTCAAGAACAGTAACCTTGCAGATACT
CCCTTTAACGCTTTAATTATATTTGGAATTTGATGTTGTGGAGAACATTCTATTAACAGG
TGTATATAGTCTTTATCAGTTTCCATTTCAGAAATTTCAAAATTATTATCATTTGCTATT
TTTAATATTATTTCTTTAAGAGATTTATCCACATCACCAAACAATACTTTTCTTCTGTAT

I have also changed the code used with the gtf file and while now I don't get that error anymore, the new issue is that pretty much all my reads fall now in the no_feature category.

__no_feature    25773225
__ambiguous     141
__too_low_aQual 0
__not_aligned   4975774
__alignment_not_unique  73371

This is the code I have used:

htseq-count -t transcript -i gene_id -s y /path/to/bam /path/to/gtf -c counts.csv

Thanks!!

ADD REPLY
0
Entering edit mode

this looks more like a FASTA file than a gff file. While those fasta lines are allowed in the GFF3 format: https://gmod.org/wiki/GFF3#GFF3_Sequence_Section they are quite uncommon.

ADD REPLY
0
Entering edit mode

HI Pierre,

Thank you! I am not too familiar with the gff file format, but I have now filtered out the FASTA data which was at the end of the fasta file, as the first ~3000 lines contained the right info (see below)

.47320_2_326.1  Prodigal:002006 CDS 207 524 .   -   0   ID=47320_2#326

I have now run the code with the new gff file, and while I don't get the error, I get now that most reads fall into the "no_feature" category, as I get with the gtf..

ADD REPLY
1
Entering edit mode

Why is there a . (period) at the beginning of each name. That is likely not a good thing.

ADD REPLY
0
Entering edit mode

It was just the way the genome was called, I think. I wasn't the person assembling and annotating the genome, so I could ask. Do you think that could be creating problems?

ADD REPLY
0
Entering edit mode

I have re-run hisat2 and htseq after removing the period at the beginning of each name, and that didn't change the output unfortunately.

ADD REPLY
0
Entering edit mode

In your command, "path/to" ... they actually end with .bam and .gtf right? it isn't just the path, but the file also at the end of it.

htseq-count -t transcript -i gene_id -s y /path/to/**.bam** /path/to/**.gtf** -c counts.csv
ADD REPLY
0
Entering edit mode

Hi! Yes, they are actual .bam and .gtf files

ADD REPLY
0
Entering edit mode
BAM/SAM

.47320_2_326.1  1293343 2500979 0
.47320_2_326.2  389930  614450  0
.47320_2_326.3  278970  234496  0
.47320_2_326.4  231691  394048  0

How did you get that output from BAM/SAM file?

ADD REPLY
0
Entering edit mode

Hi! I got that running samtools idxstats to get the chromosome names and compare them to the .gtf file.

ADD REPLY

Login before adding your answer.

Traffic: 1673 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6