Reading vcf file using python gives UnicodeDecodeError
2
0
Entering edit mode
5.8 years ago
Medhat 9.8k

python version Python 3.5.4

Trying to process vcf file as shown in the code below:

 with open(myfile, 'r') as datain:
    for myline in datain:
        if myline.startswith("#"):
            pass
        else:
            do something

In some line (after processing couple of hundreds of lines) it raises this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 8077: invalid start byte

Any idea how to fix this error?

python vcf • 5.7k views
ADD COMMENT
0
Entering edit mode

This se answer is related, but dealing with a python 2 encoding problem. python 3 behaviour is fairly different in this area.

ADD REPLY
0
Entering edit mode

As you mentioned it is already decoded.

ADD REPLY
0
Entering edit mode

@ Kevin Blighe If I used the code like: myline = myline.encode('utf-8').strip() It will give this error:

TypeError: startswith first arg must be bytes or a tuple of bytes, not str Then I could use str() but this will not help in fixing the main issue.

ADD REPLY
0
Entering edit mode

Are you using an European locale setting by any chance? What does locale produce? You may need to force the locale to be UTF-8. See this: https://unix.stackexchange.com/questions/87745/what-does-lc-all-c-do

ADD REPLY
0
Entering edit mode

The error message shows that python is using utf8 to decode. The question is what codec of the input

ADD REPLY
0
Entering edit mode
5.8 years ago
e.benn ▴ 110

Python 3.0+ decodes files by default.

https://docs.python.org/3/library/functions.html#open

"the contents of the file are returned as str, the bytes having been first decoded using a platform-dependent encoding or using the specified encoding if given."

Apparently your file is not encoded in utf-8. My guesses would be that it is gzipped, or encoded in some other encoding. What happens if you file myfile.vcf ? Does it report gzip encoded data? Does it report the codec?

Try:

# read as binary (~ascii) data (like python 2)
with open(myfile, 'rb') as datain:

# specify the actual encoding
with open(myfile, 'r', encoding="ascii") as datain:

# the vcf is gzipped
with gzip.open(filename) as datain:
ADD COMMENT
0
Entering edit mode

Actually, it is not gzipped file. If I read as binary 'rb' I need to convert it again to string.

ADD REPLY
0
Entering edit mode

The vcf spec just states 'text' encoding. https://samtools.github.io/hts-specs/VCFv4.2.pdf Most likely you have ascii. Specify ascii as the encoding parameter to open. There is some chance you don't have ascii if the file has been moved between platforms. What does the file program say?

ADD REPLY
0
Entering edit mode

The file is data also the loop does not fail in the first line it reads couple of hundreds of line then raise that error.

ADD REPLY
0
Entering edit mode

when using: with open(myfile, 'r', encoding="ascii") as datain:

It changes the error to : UnicodeDecodeError: 'ascii' codec can't decode byte 0x9d in position 8077: ordinal not in range(128)

ADD REPLY
0
Entering edit mode
5.8 years ago
Medhat 9.8k

So I manage to work around , but did not fix the issue.

The original file was a result of using bcftools merge A.vcf B.vcf C.vcf > someresult.vcf and that gave me the error above.

But when using vcf-merge A.vcf B.vcf C.vcf > someresult.vcf The issue was fixed.

This does not clearly answer my question, But it fixed the issue.

ADD COMMENT
1
Entering edit mode

Always specify output format with bcftools, it screws you over otherwise.

ADD REPLY
0
Entering edit mode

I tryied your suggestion using:

-O z -o result.vcf.gz

But still have (result, consumed) = self._buffer_decode(data, self.errors, final) UnicodeDecodeError: 'utf-8' codec can't decode byte 0x9d in position 8116: invalid start byte

ADD REPLY
0
Entering edit mode

Does reading the individual files work OK?

ADD REPLY
0
Entering edit mode

Yes I will just stay with file from vcftools

ADD REPLY
0
Entering edit mode

When you specify to bcftools that the output is to be gzipped, you get the same error. This again suggests to me that the original problem was a gzipped vcf file. What is the result of running file myfile.vcf

ADD REPLY
0
Entering edit mode

The old file was data as I mentioned in a comment on your answer. C: Reading vcf file using python gives UnicodeDecodeError

ADD REPLY
0
Entering edit mode

Apologies, I missed that. On my system a .vcf file is reported as : Variant Call Format (VCF) version 4.1, ASCII text A .vcf.gz files is reported as: gzip compressed data, extra field. It sounds like the bcftools command may be making a corrupt file, maybe a bug report would be useful. Did you check the bcftools command for an error code? What happens if you bcftools view the merged file? Which bcftools version?

ADD REPLY
0
Entering edit mode

I do not have access now But what you are suggesting is really interesting, I will try and keep you updates (when using bcftools there were no errors)

ADD REPLY
0
Entering edit mode

Hopefully you can get to the bottom of the issue. When I encounter a data problem like this I get scared - if such a well used program as bcftools has an issue with the input it could mean the data is corrupt - maybe the alternative tool vcf-merge is failing in a different way that seems to work, but is actually giving you garbage data.

ADD REPLY
0
Entering edit mode

There are always edge cases that make even well-tested tools like bcftools fail. An example is using bcftools norm on an indel-heavy VCF - I ran into a scenario where it made an entry with identical chr, pos, ref and alt entries, with just the DP/DQ/something-of-that-sort different.

ADD REPLY

Login before adding your answer.

Traffic: 1638 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6