Hi,
I have been trying to analyse RNA-seq data from a bacterial isolate for which I have the genome sequenced. I have used hisat2 for the alignment and am now trying to get count data using HTseq, however, I seem to have problems with the gtf/gff file.
I have started by using the gtf file, but I get the error:
Warning: No features of type '{args.feature_type}' found.
The alignment file has no chromosomes in common with the GFF/GTF file.
I have checked that the chromosomes (in this case they would be different contigs) are called the same in the bam and gtf file, and to me it looks like they do (see below for reference)
GTF:
.47320_2_326.1 GenBank transcript 207 524 . - . transcript_id "47320_2#326.contigs_spades_00001"; gene_id "47320_2#326.contigs_spades_00001"
.47320_2_326.1 GenBank CDS 207 524 . - 1 transcript_id "47320_2#326.contigs_spades_00001";
.
47320_2_326.1 GenBank transcript 849 1991 . + . transcript_id "47320_2#326.contigs_spades_00002"; gene_id "47320_2#326.contigs_spades_00002"
BAM/SAM
.47320_2_326.1 1293343 2500979 0
.47320_2_326.2 389930 614450 0
.47320_2_326.3 278970 234496 0
.47320_2_326.4 231691 394048 0
Since this was not working, I have also tried using the gff file, but I get the following error:
Error processing GFF file (line 3049 of file 47320_2#326.contigs_spades.gff):
not enough values to unpack (expected 9, got 1)
[Exception type: ValueError, raised in features.py:138]
I am no expert in bioinformatics, so any help would be very much appreciated!
please, show us the output of
Hi Pierre,
Thanks for the help! I get this:
I have also changed the code used with the gtf file and while now I don't get that error anymore, the new issue is that pretty much all my reads fall now in the no_feature category.
This is the code I have used:
Thanks!!
this looks more like a FASTA file than a gff file. While those fasta lines are allowed in the GFF3 format: https://gmod.org/wiki/GFF3#GFF3_Sequence_Section they are quite uncommon.
HI Pierre,
Thank you! I am not too familiar with the gff file format, but I have now filtered out the FASTA data which was at the end of the fasta file, as the first ~3000 lines contained the right info (see below)
I have now run the code with the new gff file, and while I don't get the error, I get now that most reads fall into the "no_feature" category, as I get with the gtf..
Why is there a
.
(period) at the beginning of each name. That is likely not a good thing.It was just the way the genome was called, I think. I wasn't the person assembling and annotating the genome, so I could ask. Do you think that could be creating problems?
I have re-run hisat2 and htseq after removing the period at the beginning of each name, and that didn't change the output unfortunately.
In your command, "path/to" ... they actually end with .bam and .gtf right? it isn't just the path, but the file also at the end of it.
Hi! Yes, they are actual .bam and .gtf files
How did you get that output from BAM/SAM file?
Hi! I got that running samtools idxstats to get the chromosome names and compare them to the .gtf file.