Question

Ensembl encoding problem

0

Entering edit mode

3.4 years ago

giammafer ▴ 20

Hi everybody, I have a problem with two files downloaded from Ensembl FTP

Homo_sapiens.GRCh38.pep.all.fa.gz - http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/pep/

Homo_sapiens.GRCh38.104.chr.gtf.gz - http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/

I need to open it on Windows system. I tried with Word and WordPad. But it seems that the encoding is not recognized. Indeed, Word suggests a list of possible encoding when I try to open the files. But none of them is suitable to be used to translate the files in a readable format.

enter image description here

I also tried to open them with a Python script but I get always the same error

def file_head(file_name, number_of_lines, encode="utf8"):
    file_hand = open(file_name, 'r', encoding=encode)
    for i,line in enumerate(file_hand):
        print(line)
        if i > number_of_lines:
            break
    file_hand.close()

# ------------ MAIN --------------

filename = 'myfasta.fasta'
file_head(filename, 50)

The error message is always like that:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

I think that these files from Ensembl are used a lot by researchers. But I did not find any valid solution on the web. I do not know where I mistake.

Thank you in advance for your help.

Ensembl encoding • 1.6k views

ADD COMMENT • link updated 20 months ago by Ram 44k • written 3.4 years ago by giammafer ▴ 20

1

Entering edit mode

3.4 years ago

Kevin Blighe 88k

Have you used gunzip (or another decompression tool) to decompress these after having downloaded them? Can you show all commands after you downloaded the original files?

I can open one of these files on Windows 10, after having decompressed with 7-Zip:

Kevin

ADD COMMENT • link 3.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you Kevin for the suggestion.

I was using WinRar but I installed 7zip and it works.

However, I think that the main problem was the double zip due to the Firefox download.

ADD REPLY • link 3.4 years ago by giammafer ▴ 20

0

Entering edit mode

3.4 years ago

Mensur Dlakic ★ 28k

These files are gzipped, which is a form of compression. You need a program called gunzip to unpack them - they will lose the .gz extension after unpacking, and become ordinary text files that can be opened in Word or WordPad.

ADD COMMENT • link 3.4 years ago by Mensur Dlakic ★ 28k

score 2 · Accepted Answer · 2021-07-09

2

Entering edit mode

3.4 years ago

Emily 24k

Did you use Firefox to download? Firefox, for some reason, double-zips the already zipped files, which means when you unzip them, you just get binary files again. We recommend using another browser or a command line option (eg wget) to download.

ADD COMMENT • link 3.4 years ago by Emily 24k

0

Entering edit mode

Yes Emily you are right.

I used Chrome instead of Firefox, I unzipped with WinRar and it works. Now I can open them in Word and with Python script.

enter image description here

I was not aware of the double zip behaviour of Firefox.

Thank you very much.

ADD REPLY • link 3.4 years ago by giammafer ▴ 20

0

Entering edit mode

It's not just our files it does it to, so be aware.

ADD REPLY • link 3.4 years ago by Emily 24k