Kennis Blogs File upload and character sets / encoding

File upload and character sets / encoding

When working with file uploads from a browser it is good to realize that you don't know what is coming. Character set wise that is. You simply do not get a hint from your browser that says: here is a UTF-8 encoded Unicode text file. Or, beware, this document I am sending you now is created on a Windows machine, using the windows-1252 character set.

 

 

Why don't browsers do this? The answer is rather simple: they don't have a clue either. The file is read from disk and most file systems don't store meta information on character set or encoding.

 

How do we correctly deal with that? There is only 1 valid option. The person uploading the file must tell us what it is. If you have a form with a file upload, put a drop down next to it with a list of character sets and let the user indicate what he is sending. If it's a system sending in files via REST make sure you know what it sending, or give it a parameter to indicate the character set used.

 

That is the only solution that is 100% guaranteed. If you want to try something more advanced, look at IBM's icu project (Java / C/ C++). It has functionality that detects the charset or encoding of character data in an unknown format, but the results can not be guaranteed to always be correct.