add comment.

master
https://www.google.com/accounts/o8/id?id=AItOawmKvEhLuOASwaGDP803twCdJYPyXe6JQHY 2011-11-04 06:31:35 -04:00 committed by admin
parent f6d66d4684
commit f2d322efee
1 changed files with 9 additions and 0 deletions

View File

@ -18,3 +18,12 @@ BOMless LE and BE input is probably a lost cause.
Optimally, UTF-16 (which is ubiquitous in the Windows world) and UTF-32 should be fully supported, probably by converting to mostly-UTF-8 and using `&#xXXXX;` or `&#DDDDD;` XML escapes where necessary. Optimally, UTF-16 (which is ubiquitous in the Windows world) and UTF-32 should be fully supported, probably by converting to mostly-UTF-8 and using `&#xXXXX;` or `&#DDDDD;` XML escapes where necessary.
Suboptimally, UTF-16 and UTF-32 should be converted to UTF-8 where cleanly possible and a warning printed where impossible. Suboptimally, UTF-16 and UTF-32 should be converted to UTF-8 where cleanly possible and a warning printed where impossible.
----
Reading the wikipedia pages about [[!wikipedia UTF-8]] and [[!wikipedia UTF-16]], all valid Unicode characters are representable in UTF-8, UTF-16 and UTF-32, and the only errors possible with UTF-16/32 -> UTF-8 translation are when there are encoding errors in the original document.
Of course, it's entirely possible that not all browsers support utf-8 correctly, and we might need to support the option of encoding into [[!wikipedia CESU-8]] instead, which has the side-effect of allowing the transcription of UTF-16 or UTF-32 encoding errors into the output byte-stream, rather than pedantically removing those bytes.
An interesting question would be how to determine the character set of an arbitrary new file added to the repository, unless the repository itself handles character-encoding, in which case, we can just ask the repository to hand us a UTF-8 encoded version of the file.
-- [[Martin Rudat|http://www.toraboka.com/~mrudat]]