Hello everyone,
This should be very easy and I know it, but I am stuck with it and I cannot pinpoint my mistake.
I wanted a boolean python function to check if a given file is in fasta format. And this, without manually checking myself the extension (.fa, .fasta etc). I have found this solution which suited me. When parsing for needed files, my python script now use this "is_fasta" function.
My problem is that for some files it works, for some others it doesn't... When it doesn't I have an error of the sort when trying to read the fasta file :
UnicodeDecodeError: 'utf-8' codec cant decode byte 0xf3 in position 551: invalid continuation byte
#or
UnicodeDecodeError: 'utf-8' codec cant decode byte 0x87 in position 23: invalid start byte
So I understand they might be something with the encoding of the file. I usually check it using the command file
, but for files that works as for files that does not works, I get "ASCII text", and when asking for more information with file -i
, he just print "regular file". So I don't see anything about utf-8 or so. And my comprehension of file format kind of stop here.
I am working in a conda environment I have made with several tools, the python version inside is 3.6.10. I have added biopython with regular conda command and the channel conda-forge.
Does anyone has an advice about this issue ? Or should I just revert to my original idea to just check the file extension ?
Thank you and have a nice day,
This may be related to unix
LOCALE
you are using. https://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byteHmmm, indeed good lead. I forgot to say I am on MAC environment (and totally new to it). It seems my LANG variable is empty... I will try to see if playing around that idea helps solving the issue.
Ah sadly this was not the issue. I use a custom bunch of setting for bash (zsh), and I followed how to properly set the locale following these steps here : https://github.com/ohmyzsh/ohmyzsh/issues/7558 . But yeah, now even with my LANG fixed, it is still not working and showing encoding errors :/