Question

htseq-count SAM Processing error "can't decode byte 0xa7"

0

Entering edit mode

6.0 years ago

Glubbdrubb • 0

I am getting strange errors from htseq-count when analyzing bam files created by STAR. They are sorted by coordinate (default output by STAR). I am analysing about 18 bam files, and I get the error for about 40% of them.

I get different errors, depending on whether or not I I read it in directly or if it is piped in via samtools view. Here are my examples:

htseq-count -t exon -s no -m union -r pos -f bam mapped_reads/sample1.bam genome.gtf > htseq_count/sample1.count

27414 GFF lines processed.

Error occured when processing SAM input (record #17507 in file mapped_reads/sample1.bam): 'ascii' codec can't decode byte 0xa7 in position 2: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

Here is line 17507 from sample1.bam:

FCC4TF9ACXX:8:1308:20606:31534# 403     PRELSG_01_v1    293     1       75M     =       215     -153    TGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGACAAACCA     bb`bb_cccbccb_]T^Z]Z`\Z]eceec`bVgb^`cbd_]eebefffebaaecec]ffaZffhdddgbbechge     NH:i:3  HI:i:2  AS:i:144        nM:i:0  MD:Z:75

I can't see anything strange in that line.

Here is the error I get when I pipe in via samtools

samtools view mapped_reads/sample1.bam | htseq-count -t exon -s no -m union -r pos - genome.gtf > htseq_count/sample1.count

27414 GFF lines processed.

Error occured when processing SAM input (line 17478): 'utf-8' codec can't decode byte 0xa7 in position 8038: invalid start byte [Exception type: UnicodeDecodeError, raised in codecs.py:321]

And here is line 17478:

FCC4TF9ACXX:8:1314:2076:30660#  339     PRELSG_01_v1    287     0       75M     =       286     -76     TAGTATTGTGACAGTAATCCAACTTGGTACGAGAGGATTAGTTGGTTCAGACAATTGGTACAGCAATTGGTTGAC     aaa`_Ta_]Yaaaaaa``_Z__ZGUZZGTFb^\cdd_Y\VVeeeba\MNXNXI[hfeaXOXSa[e[cca^[efae     NH:i:8  HI:i:7  AS:i:143        nM:i:0  MD:Z:75

Can anyone figure this out?

P.S. I realize STAR will output gene counts, but it assumes strandedness, which my data is not, unfortunately.

EDIT: After converting all my fastq files to ASCII-33 encoding using reformat.sh, I reran the pipeline. Many of the new bam files still give me errors. I will post an example here.

Error occured when processing SAM input (record #166582 in file mapped_reads/sample1.bam): 'ascii' codec can't decode byte 0x82 in position 2: ordinal not in range(128) [Exception type: UnicodeDecodeError, raised in libcutils.pyx:134]

Here are the first 2 records of that bam file:

FCC4MT3ACXX:7:1205:3160:39383#  419     PRELSG_01_v1    2       1       5S70M   =       85      158     ATGGGTATCGATCCTTTATATTTGCAAAATTACGGAATTTATTGCTTACTGTGCATATAGAGGTGTCTGAAAAGT     G:??E:CCDGH;FGGG?BGF?**9<?EHDBFHIGA=GC=CECC)73=A@;.;;@C@CCDCCD@(5553:@CDC##     NH:i:3  HI:i:3  AS:i:141
    nM:i:1  MD:Z:38T31

FCC4MT3ACXX:7:2106:14022:53098# 419     PRELSG_01_v1    2       1       5S70M   =       17      90      ATGGGTATCGATCCTTTATATTTGCAAAATTACGGAATTTATTTCTTACTGTGCATATAGAGGTGTCTGAAAAGT     IIIJJEGGJJJJJJJJJGIJIJIJJGIJJJJJJJJIJJIJJJHHHHHHHFFDFFEFCEEEEEE;?ACCDC@CCC:     NH:i:3  HI:i:3  AS:i:143
    nM:i:0  MD:Z:70

I can't see any illegal characters...

EDIT: Here is record #166582

FCC4MT3ACXX:7:1115:11607:25097# 339     PRELSG_05_v1    617015  1       66M9S   =       616991  -90     TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCACAGGACGA     DDDDD@DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDFHJJJJJIJJIIGJJJIHJJJ     NH:i:4  HI:i:2  AS:i:139        nM:i:0  MD:Z:66

EDIT: As a new user, I am rate limited to 5 posts per hour, which is why I'm replying to your comment here. It is a Plasmodium relictum genome from PlasmoDB. And I can't explain these oddities in the bam file. Especially since most of the samples work fine in htseq-count.

RNA-Seq htseq • 4.5k views

ADD COMMENT • link 6.0 years ago by Glubbdrubb • 0

score 1 · Answer 1 · 2018-12-07

1

Entering edit mode

6.0 years ago

Bastien Hervé 5.9k

I bet you used an old plateform to sequence your data. Old Illumina sequencers were base on a Phred64 score which is not accepted in a lot of recent software

https://www.drive5.com/usearch/manual/quality_score.html

In your quality sequence you have some ` and [ ] which do not exist anymore in ASCII_BASE 33 (Phred33)

I suggest you to transform your fastq from phred64 to phred33 quality, with BBmap for example

See also : Converting Phred64 fastq to Phred33 fastq

ADD COMMENT • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

It was an old MiSeq platform, so I'll do as you say and let know.

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

Please see my edit :-)

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

I need this record please :

record #166582 in file mapped_reads/sample1.bam

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

My mistake. Got the position and record mixed up. Here ya go:

FCC4MT3ACXX:7:1115:11607:25097# 339     PRELSG_05_v1    617015  1       66M9S   =       616991  -90     TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCACAGGACGA     DDDDD@DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDBDDDDDDDDDDDFHJJJJJIJJIIGJJJIHJJJ     NH:i:4  HI:i:2  AS:i:139        nM:i:0  MD:Z:66

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

What was your command line ? And version software ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

htseq version 0.11.0

htseq-count -q -t exon -f bam -r pos -s no -m union mapped_reads/D8_Bird8_42-3p.bam p.relictum/PlasmoDB_genome/PlasmoDB-39_PrelictumSGS1-like.gtf > htseq_count/D8_Bird8_42-3p.count

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

Try to remove unmapped reads and not primary alignment

~~samtools view -f 4 -F 256 input.bam~~

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

I get an empty result with that. I don't think STAR produces bam files with unmapped reads using the default parameters.

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

Oups !

samtools view -F 260 input.bam

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

I did that. The resulting bam file is about a third smaller. But htseq count produced the same "ordinal not in range" error. Here is the corresponding record in the new bam file:

FCC4MT3ACXX:7:2203:9969:30526#  163     PRELSG_06_v1    268502  255     75M     =       268607  258     CCTACATCACCACATTGTCTCATAATATCTTTTAAATGTTGCCATTTGCAATTGTCTGGCAAATTGCTAACAATA     GJGAFGDGIIJJIJHIJHICHIGCHEGGHGIIJJCHGHFHIGIIEEIIHIGDCHEGEEFGHHFFDCDFFEEE>3;     NH:i:1  HI:i:1  AS:i:143        nM:i:1  MD:Z:75

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

This is weird

nM:i:1

You have a huge space between AS:i:143 and nM:i:1

Plus the 'n' is 'nM' should be in uppercase

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

What is your data ? What is PRELSG_05_v1 ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

It is a Plasmodium relictum genome from PlasmoDB. And I can't explain these oddities in the bam file. Especially since most of the samples work fine in htseq-count. When I view the bam file in less (piped in from samtools view), all the white spaces are tabs, but not all tabs are the same length. You can see this for yourself by searching for a tab character in less (but don't search for \t, that will just do a literal search for that string).

It is odd that the QUAL field has a lower value character in it. Can you show me the specifications for that field?

EDIT: I checked the bam files for which htseq-count works. It has the variable length tabs and nM in the QUAL fields.

P.S. As a new user with low reputation, I am limited to 5 posts per 6 hours, which is why I didn't reply to your last message until now. If I hit that limit again, I'll try help you out by editing my main question.

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

Especially since most of the samples work fine in htseq-count

You got htseq-count output results for some bam of your experiment ?

It is odd that the QUAL field has a lower value character in it

What do you mean by has a lower value character ? Put an example please

How many bam are you struggling with ? And how many succeded htseq-count process ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

I meant lower case, not lower value..

I have 18 bam files, 10 of them work without a problem. Just some info about the nature of the source data: The paired-end reads in the FASTQ files are trimmed.

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

Be aware !

This is Quality :

GJGAFGDGIIJJIJHIJHICHIGCHEGGHGIIJJCHGHFHIGIIEEIIHIGDCHEGEEFGHHFFDCDFFEEE>3;

This are TAGs :

NH:i:1  HI:i:1  AS:i:143        nM:i:1  MD:Z:75

And Quality as well as TAGs can not have lower case character (except the i for TAGs)

This is why the :

nM:i:1

is very strange to me

Bam files with nM:i:1 (in lower case character) succeded htseq-count ?

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Ah yes, my mistake. I misread the description of sam fields.

Yup, those bam files worked in htseq-count and htseq-qa (qa can only accept sam files, so I converted them first). Here's an example line from a bam file that worked:

FCC4MT3ACXX:8:1109:8871:93069#  419     PRELSG_01_v1    1       1       5S70M   =       57      131     ATGTGCTTTCGATCCTTTTTATTTTCATAATTACGGAATTTATTTCTTACTGTGCATATAGAGGTGTCTGAAAAG     <4****:*:4)):0)0BG*?)?FB****/)8C8=BF@(;.7)7@=ECCA>>B;@?C77A################     NH:i:3  HI:i:3  AS:i:135        nM:i:4  MD:Z:2A10A5G2A47

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

Did you moved these bad bam from window to linux or vice versa ?

You could try to focus on a "bad" bam.

For each error you get extract the alignment, copy it in a text file, and re-run till your bam is working or till you get enought alignment to compare them closely (5 to 10 alignments)

Look at the number of tab, space, space that are not space (non-breaking space, ascii 160), upper case, lower case...

Try to upload this alignment text file with 5 to 10 alignments and put the link here please

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

The bam files were only worked on in Linux, using linux tools.

I wrote a python script that printed the line number of the first line that produces an error. https://pastebin.com/TEki6jhe This produces exactly the same error as htseq-count if I don't catch the exception. When I examined the offending line, I couldn't detect the problem by eye.

But then I used tr to remove ALL non ascii characters:

tr -cd '\1-\176' < D8_Bird8_42-3p.sam > sample.sam

This new file is 54 bytes smaller than the original, and is parsed by my script and htseq-count without error. Although this doesn't explain the cause of the error, it does seem to fix it. What do you think? If you agree, I'll add this as the answer.

P.S. I got the tr trick from https://alvinalexander.com/linux-unix/how-to-use-unix-tr-command-filter-remove-extended-ascii-characters-files

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

If you can't see the non ascii characters by eye, it's propably the non-breaking line space as ascii 160. Try your tr command and only remove the character 160, to see if you got multiple non ascii character or just this one

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

What is the octal encoding for ASCII 160 in octal? Cause my tr command requires octal specifications. I assumed 160 is decimal, and according to http://ascii-table.com/calculator.php , 160 dec encodes to 240 octal.

I deleted all of those characters using

tr -d '\240' < D8_Bird8_42-3p.sam > sample.sam

but the resulting file was identical. I think there must be other invisible ASCII characters, or I converted from 160 to 240 incorrectly.

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

I think that with this command :

tr -cd '\1-\176' < D8_Bird8_42-3p.sam > sample.sam

You removed what is not in 1-176

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

Yes, but keep in mind those are octal ascii codes, not decimal ascii codes. Removing only 160 (which is 240 octal) did not make any difference to the file.

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0

0

Entering edit mode

Try this on a single sam result that failed and compare the two lines

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.9k

0

Entering edit mode

That didn't reveal anything. I think the invisible character is lost when I extract the lines. But then I did a diff of the original and cleaned files:

diff -c sample.sam D8_Bird8_42-3p.sam | less

This reveals the rather odd <82> near the end of the affected lines. Sometimes it even repeats. You can get the whole thing here : https://pastebin.com/dwkr4THJ . In my text editor and browser, it is represented as unprintable character: �

Very strange. I'd love to know why STAR created these.

Once they're "cleaned" I get much better compression ratios with my bam files.

P.S. Yes, I actually typed a '<' then '82' and then '>'. That's not a strange html rendering issue.

ADD REPLY • link 6.0 years ago by Glubbdrubb • 0