I am getting strange errors from htseq-count when analyzing bam files created by STAR. They are sorted by coordinate (default output by STAR). I am analysing about 18 bam files, and I get the error for about 40% of them.
I get different errors, depending on whether or not I I read it in directly or if it is piped in via samtools view. Here are my examples:
htseq-count -t exon -s no -m union -r pos -f bam mapped_reads/sample1.bam genome.gtf > htseq_count/sample1.count
27414 GFF lines processed.
Error occured when processing SAM input (record #17507 in file mapped_reads/sample1.bam): 'ascii' codec can't decode byte 0xa7 in position 2: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]
Here is line 17507 from sample1.bam:
FCC4TF9ACXX:8:1308:20606:31534# 403 PRELSG_01_v1 293 1 75M = 215 -153 TGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGACAAACCA bb`bb_cccbccb_]T^Z]Z`\Z]eceec`bVgb^`cbd_]eebefffebaaecec]ffaZffhdddgbbechge NH:i:3 HI:i:2 AS:i:144 nM:i:0 MD:Z:75
I can't see anything strange in that line.
Here is the error I get when I pipe in via samtools
samtools view mapped_reads/sample1.bam | htseq-count -t exon -s no -m union -r pos - genome.gtf > htseq_count/sample1.count
27414 GFF lines processed.
Error occured when processing SAM input (line 17478): 'utf-8' codec can't decode byte 0xa7 in position 8038: invalid start byte [Exception type: UnicodeDecodeError, raised in codecs.py:321]
And here is line 17478:
FCC4TF9ACXX:8:1314:2076:30660# 339 PRELSG_01_v1 287 0 75M = 286 -76 TAGTATTGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGAC aaa`_Ta_]Yaaaaaa``_Z__ZGUZZGTFb^\cdd_Y\VVeeeba\MNXNXI[hfeaXOXSa[e[cca^[efae NH:i:8 HI:i:7 AS:i:143 nM:i:0 MD:Z:75
Can anyone figure this out?
P.S. I realize STAR will output gene counts, but it assumes strandedness, which my data is not, unfortunately.
EDIT: After converting all my fastq files to ASCII-33 encoding using reformat.sh, I reran the pipeline. Many of the new bam files still give me errors. I will post an example here.
Error occured when processing SAM input (record #166582 in file mapped_reads/sample1.bam): 'ascii' codec can't decode byte 0x82 in position 2: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]
Here are the first 2 records of that bam file:
FCC4MT3ACXX:7:1205:3160:39383# 419 PRELSG_01_v1 2 1 5S70M = 85 158 ATGGGTATCGATCCTTTATATTTGCAAAATTACGGAATTTATTGCTTACTGTGCATATAGAGGTGTCTGAAAAGT G:??E:CCDGH;FGGG?BGF?**9<?EHDBFHIGA=GC=CECC)73=A@;.;;@C@CCDCCD@(5553:@CDC## NH:i:3 HI:i:3 AS:i:141
nM:i:1 MD:Z:38T31
FCC4MT3ACXX:7:2106:14022:53098# 419 PRELSG_01_v1 2 1 5S70M = 17 90 ATGGGTATCGATCCTTTATATTTGCAAAATTACGGAATTTATTTCTTACTGTGCATATAGAGGTGTCTGAAAAGT IIIJJEGGJJJJJJJJJGIJIJIJJGIJJJJJJJJIJJIJJJHHHHHHHFFDFFEFCEEEEEE;?ACCDC@CCC: NH:i:3 HI:i:3 AS:i:143
nM:i:0 MD:Z:70
I can't see any illegal characters...
EDIT: Here is record #166582
FCC4MT3ACXX:7:1115:11607:25097# 339 PRELSG_05_v1 617015 1 66M9S = 616991 -90 TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCACAGGACGA DDDDD@DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDFHJJJJJIJJIIGJJJIHJJJ NH:i:4 HI:i:2 AS:i:139 nM:i:0 MD:Z:66
EDIT: As a new user, I am rate limited to 5 posts per hour, which is why I'm replying to your comment here. It is a Plasmodium relictum genome from PlasmoDB. And I can't explain these oddities in the bam file. Especially since most of the samples work fine in htseq-count.
It was an old MiSeq platform, so I'll do as you say and let know.
Please see my edit :-)
I need this record please :
My mistake. Got the position and record mixed up. Here ya go:
What was your command line ? And version software ?
htseq version 0.11.0
Try to remove unmapped reads and not primary alignment
samtools view -f 4 -F 256 input.bamI get an empty result with that. I don't think STAR produces bam files with unmapped reads using the default parameters.
Oups !
I did that. The resulting bam file is about a third smaller. But htseq count produced the same "ordinal not in range" error. Here is the corresponding record in the new bam file:
This is weird
You have a huge space between
AS:i:143
andnM:i:1
Plus the 'n' is 'nM' should be in uppercase
What is your data ? What is
PRELSG_05_v1
?It is a Plasmodium relictum genome from PlasmoDB. And I can't explain these oddities in the bam file. Especially since most of the samples work fine in htseq-count. When I view the bam file in less (piped in from samtools view), all the white spaces are tabs, but not all tabs are the same length. You can see this for yourself by searching for a tab character in less (but don't search for \t, that will just do a literal search for that string).
It is odd that the QUAL field has a lower value character in it. Can you show me the specifications for that field?
EDIT: I checked the bam files for which htseq-count works. It has the variable length tabs and nM in the QUAL fields.
P.S. As a new user with low reputation, I am limited to 5 posts per 6 hours, which is why I didn't reply to your last message until now. If I hit that limit again, I'll try help you out by editing my main question.
You got htseq-count output results for some bam of your experiment ?
What do you mean by
has a lower value character
? Put an example pleaseHow many bam are you struggling with ? And how many succeded htseq-count process ?
I meant lower case, not lower value..
I have 18 bam files, 10 of them work without a problem. Just some info about the nature of the source data: The paired-end reads in the FASTQ files are trimmed.
Be aware !
This is Quality :
This are TAGs :
And Quality as well as TAGs can not have lower case character (except the
i
for TAGs)This is why the :
is very strange to me
Bam files with
nM:i:1
(in lower case character) succeded htseq-count ?Ah yes, my mistake. I misread the description of sam fields.
Yup, those bam files worked in htseq-count and htseq-qa (qa can only accept sam files, so I converted them first). Here's an example line from a bam file that worked:
Did you moved these bad bam from window to linux or vice versa ?
You could try to focus on a "bad" bam.
For each error you get extract the alignment, copy it in a text file, and re-run till your bam is working or till you get enought alignment to compare them closely (5 to 10 alignments)
Look at the number of tab, space, space that are not space (non-breaking space, ascii 160), upper case, lower case...
Try to upload this alignment text file with 5 to 10 alignments and put the link here please
The bam files were only worked on in Linux, using linux tools.
I wrote a python script that printed the line number of the first line that produces an error. https://pastebin.com/TEki6jhe This produces exactly the same error as htseq-count if I don't catch the exception. When I examined the offending line, I couldn't detect the problem by eye.
But then I used tr to remove ALL non ascii characters:
This new file is 54 bytes smaller than the original, and is parsed by my script and htseq-count without error. Although this doesn't explain the cause of the error, it does seem to fix it. What do you think? If you agree, I'll add this as the answer.
P.S. I got the tr trick from https://alvinalexander.com/linux-unix/how-to-use-unix-tr-command-filter-remove-extended-ascii-characters-files
If you can't see the non ascii characters by eye, it's propably the non-breaking line space as ascii 160. Try your
tr
command and only remove the character 160, to see if you got multiple non ascii character or just this oneWhat is the octal encoding for ASCII 160 in octal? Cause my tr command requires octal specifications. I assumed 160 is decimal, and according to http://ascii-table.com/calculator.php , 160 dec encodes to 240 octal.
I deleted all of those characters using
but the resulting file was identical. I think there must be other invisible ASCII characters, or I converted from 160 to 240 incorrectly.
I think that with this command :
You removed what is not in 1-176
Yes, but keep in mind those are octal ascii codes, not decimal ascii codes. Removing only 160 (which is 240 octal) did not make any difference to the file.
Try this on a single sam result that failed and compare the two lines
That didn't reveal anything. I think the invisible character is lost when I extract the lines. But then I did a diff of the original and cleaned files:
This reveals the rather odd <82> near the end of the affected lines. Sometimes it even repeats. You can get the whole thing here : https://pastebin.com/dwkr4THJ . In my text editor and browser, it is represented as unprintable character: �
Very strange. I'd love to know why STAR created these.
Once they're "cleaned" I get much better compression ratios with my bam files.
P.S. Yes, I actually typed a '<' then '82' and then '>'. That's not a strange html rendering issue.