30 lines
2.0 KiB
Markdown
30 lines
2.0 KiB
Markdown
Wide characters should probably be supported, or, at the very least, warned about.
|
|
|
|
Test case:
|
|
|
|
mkdir -p ikiwiki-utf-test/raw ikiwiki-utf-test/rendered
|
|
for page in txt mdwn; do
|
|
echo hello > ikiwiki-utf-test/raw/$page.$page
|
|
for text in 8 16 16BE 16LE 32 32BE 32LE; do
|
|
iconv -t UTF$text ikiwiki-utf-test/raw/$page.$page > ikiwiki-utf-test/raw/$page-utf$text.$page;
|
|
done
|
|
done
|
|
ikiwiki --verbose --plugin txt --plugin mdwn ikiwiki-utf-test/raw/ ikiwiki-utf-test/rendered/
|
|
www-browser ikiwiki-utf-test/rendered/ || x-www-browser ikiwiki-utf-test/rendered/
|
|
# rm -r ikiwiki-utf-test/ # some browsers rather stupidly daemonize themselves, so this operation can't easily be safely automated
|
|
|
|
BOMless LE and BE input is probably a lost cause.
|
|
|
|
Optimally, UTF-16 (which is ubiquitous in the Windows world) and UTF-32 should be fully supported, probably by converting to mostly-UTF-8 and using `&#xXXXX;` or `&#DDDDD;` XML escapes where necessary.
|
|
|
|
Suboptimally, UTF-16 and UTF-32 should be converted to UTF-8 where cleanly possible and a warning printed where impossible.
|
|
|
|
----
|
|
Reading the wikipedia pages about [[!wikipedia UTF-8]] and [[!wikipedia UTF-16]], all valid Unicode characters are representable in UTF-8, UTF-16 and UTF-32, and the only errors possible with UTF-16/32 -> UTF-8 translation are when there are encoding errors in the original document.
|
|
|
|
Of course, it's entirely possible that not all browsers support utf-8 correctly, and we might need to support the option of encoding into [[!wikipedia CESU-8]] instead, which has the side-effect of allowing the transcription of UTF-16 or UTF-32 encoding errors into the output byte-stream, rather than pedantically removing those bytes.
|
|
|
|
An interesting question would be how to determine the character set of an arbitrary new file added to the repository, unless the repository itself handles character-encoding, in which case, we can just ask the repository to hand us a UTF-8 encoded version of the file.
|
|
|
|
-- [[Martin Rudat|http://www.toraboka.com/~mrudat]]
|