Hello,
I am trying to run htseq-count on bam files against a reference bacterial genome in gtf format, downloaded from Ensembl
The RNA-Seq reads were single-end fastq files, I performed BWA-MEM using Galaxy which generated the bam files. I ran flagstat and got a high percentage of mapped reads, and visualized the alignment using IGV. If I am not wrong, I believe the files are already indexed/sorted by Galaxy. I downloaded the .bam (and associated .bai files to the same directory), then used these files to perform ht-seq against my reference genome (I downloaded the gtf file, and made sure chromosome identifiers matched, and removed the header as I saw previously suggested in other answers here).
This is the first few lines of my reference genome.gtf
NC_008463.1 PseudoCAP CDS 483 2027 . + 0 gene_id "A0K_RS00005"; transcript_id "5841281"; locus_tag "A0K_RS00005"; name "chromosomal replication initiator protein DnaA"; replicon_xref "NZ_AAQW01000001"
NC_008463.1 PseudoCAP CDS 2056 3159 . + 0 gene_id "A0K_RS00010"; transcript_id "50342"; locus_tag "A0K_RS00010"; name "DNA polymerase III subunit beta"; replicon_xref "NZ_AAQW01000001"
NC_008463.1 PseudoCAP CDS 3169 4278 . + 0 gene_id "A0K_RS00015"; transcript_id "50345"; locus_tag "A0K_RS00015"; name "recombinase RecF"; replicon_xref "NZ_AAQW01000001"
This is the first few lines of one of my BWA-MEM generated .bam files
@HD VN:1.3 SO:coordinate
@SQ SN:NC_008463.1 LN:6537648
@RG ID:A30_SRR2288354.fastq SM:A30_SRR2288354.fastq PL:ILLUMINA LB:SRP062593vPA14 DS:BWA-MEM on SRP062593vPA14 DT:2018-08-13
@PG ID:bwa PN:bwa VN:0.7.17-r1188 CL:bwa mem -t 6 -v 1 -R @RG\tID:A30_SRR2288354.fastq\tSM:A30_SRR2288354.fastq\tPL:ILLUMINA\tLB:SRP062593vPA14\tDS:BWA-MEM on SRP062593vPA14\tDT:2018-08-13 localref.fa /galaxy-repl/main/files/026/475/dataset_26475523.dat
SRR2288354.45585072 2048 NC_008463.1 1 60 67H33M * 0 0 TTTAAAGAGACCGGCGATTCTAGTGAAATCGAA DDCB@@C@CDDDDDDDBDDA>:@ACC>@>:?<C NM:i:0 MD:Z:33 AS:i:33 XS:i:0 RG:Z:A30_SRR2288354.fastq SA:Z:NC_008463.1,6537582,+,67M33S,60,0;
SRR2288354.45612256 0 NC_008463.1 1 60 33S47M * 0 0 CTGGGTTGACGACTTGAGGTCGCAGTGACCCCGTTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATT CCCFFFFFGHHGHJIJJJJFHIIJIGHIIJIIIGHHIJJJJIIIJJJHFDDDCCDDCCDEDDDDDDDDBBDDBDACCCDE NM:i:0 MD:Z:47 AS:i:47 XS:i:0 RG:Z:A30_SRR2288354.fastq SA:Z:NC_008463.1,6537616,+,33M47S,60,0;
SRR2288354.2322100 16 NC_008463.1 1 60 21S60M * 0 0 CTTGAGGTCGCAGTGACCCCGTTTAAAGAGACCGGCGATTCTAGTGAAATCGAACGGGCAGGTCAATTTCCAACCAGCGAT DDDDDDDDDDDDDDDDDDDFFFFHHHHHHIIJJJJHJIIJJJJJJJJJJJJJJJJJJJJIIIIJJIJJHHHHHFFFFFBBB NM:i:0 MD:Z:60 AS:i:60 XS:i:0 RG:Z:A30_SRR2288354.fastq
The MAPQ scores of all the reads are 60. The FLAGs alternate between 0, 16, 2048 and 2064. The POS increases as you move down the file, however there are multiple reads that have the same POS value.
Below is my attempt at performed htseq-count, and the error I received
(ProjectLearn) Maxwells-iMac:Alexa maxwelllab$ htseq-count -m union -s no -r pos -t CDS -i gene_id --nonunique all -f bam /Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq\ /SRP062593_A30_2.bam /Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq\ /PA14_reference_NC_2.gtf > SRP062593_A30_2.counts
Error occured when processing GFF file (line 8700 of file /Users/Alexa/Alexa/SRP062593_sra/SRP062593_htseq /PA14_reference_NC_2.gtf):
'utf-8' codec can't decode byte 0xa0 in position 1259: invalid start byte
[Exception type: UnicodeDecodeError, raised in codecs.py:321]
I am very new to bioinformatics and would appreciate any advice as to how to move forward. I am not sure why it reads "error when processing GFF file" as I input a GTF file (which I read is recommended for htseq-count). Is there something I am missing to define the reference genome as GTF? Also, in regards to the UnicodeDecode error, I know the default is utf-8, should I be using something else to run htseq-count? Is there additional filtering steps of the bam files that I am missing?
Thank you!
Alexa
There's probably a special character on line 8700 of the GTF file, have a look around there.
I used linecache to print that line (8700), as well as the line just before to see if it is reading the file correctly however I only get a " symbol
When I count the lines in the file, it actually only goes up to 8699 then gives me the same error I saw with htseq-count
No one has tested anything in python 3.7, try using 3.6 or another non-bleeding edge version.
I am using python 3.6