Hi everybody, I have a problem with two files downloaded from Ensembl FTP
Homo_sapiens.GRCh38.pep.all.fa.gz - http://ftp.ensembl.org/pub/release-104/fasta/homo_sapiens/pep/
Homo_sapiens.GRCh38.104.chr.gtf.gz - http://ftp.ensembl.org/pub/release-104/gtf/homo_sapiens/
I need to open it on Windows system. I tried with Word and WordPad. But it seems that the encoding is not recognized. Indeed, Word suggests a list of possible encoding when I try to open the files. But none of them is suitable to be used to translate the files in a readable format.
I also tried to open them with a Python script but I get always the same error
def file_head(file_name, number_of_lines, encode="utf8"):
file_hand = open(file_name, 'r', encoding=encode)
for i,line in enumerate(file_hand):
print(line)
if i > number_of_lines:
break
file_hand.close()
# ------------ MAIN --------------
filename = 'myfasta.fasta'
file_head(filename, 50)
The error message is always like that:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte
I think that these files from Ensembl are used a lot by researchers. But I did not find any valid solution on the web. I do not know where I mistake.
Thank you in advance for your help.
Yes Emily you are right.
I used Chrome instead of Firefox, I unzipped with WinRar and it works. Now I can open them in Word and with Python script.
I was not aware of the double zip behaviour of Firefox.
Thank you very much.
It's not just our files it does it to, so be aware.