ikiwiki/doc/bugs/UTF-16_and_UTF-32_are_unhan...

Wide characters should probably be supported, or, at the very least, warned about.

Test case:

    mkdir -p ikiwiki-utf-test/raw ikiwiki-utf-test/rendered
    for page in txt mdwn; do
      echo hello > ikiwiki-utf-test/raw/$page.$page
      for text in 8 16 16BE 16LE 32 32BE 32LE; do
        iconv -t UTF$text ikiwiki-utf-test/raw/$page.$page > ikiwiki-utf-test/raw/$page-utf$text.$page;
      done
    done
    ikiwiki --verbose --plugin txt --plugin mdwn ikiwiki-utf-test/raw/ ikiwiki-utf-test/rendered/
    www-browser ikiwiki-utf-test/rendered/ || x-www-browser ikiwiki-utf-test/rendered/
    # rm -r ikiwiki-utf-test/ # some browsers rather stupidly daemonize themselves, so this operation can't easily be safely automated

BOMless LE and BE input is probably a lost cause.

Optimally, UTF-16 (which is ubiquitous in the Windows world) and UTF-32 should be fully supported, probably by converting to mostly-UTF-8 and using `&#xXXXX;` or `&#DDDDD;` XML escapes where necessary.

Suboptimally, UTF-16 and UTF-32 should be converted to UTF-8 where cleanly possible and a warning printed where impossible.

----
Reading the wikipedia pages about [[!wikipedia UTF-8]] and [[!wikipedia UTF-16]], all valid Unicode characters are representable in UTF-8, UTF-16 and UTF-32, and the only errors possible with UTF-16/32 -> UTF-8 translation are when there are encoding errors in the original document.

Of course, it's entirely possible that not all browsers support utf-8 correctly, and we might need to support the option of encoding into [[!wikipedia CESU-8]] instead, which has the side-effect of allowing the transcription of UTF-16 or UTF-32 encoding errors into the output byte-stream, rather than pedantically removing those bytes.

An interesting question would be how to determine the character set of an arbitrary new file added to the repository, unless the repository itself handles character-encoding, in which case, we can just ask the repository to hand us a UTF-8 encoded version of the file.

-- [[Martin Rudat|http://www.toraboka.com/~mrudat]]
Bug: UTF-16 and UTF-32 are unhandled: New 2010-10-05 03:09:42 +02:00			`Wide characters should probably be supported, or, at the very least, warned about.`

			`Test case:`

			`mkdir -p ikiwiki-utf-test/raw ikiwiki-utf-test/rendered`
			`for page in txt mdwn; do`
			`echo hello > ikiwiki-utf-test/raw/$page.$page`
			`for text in 8 16 16BE 16LE 32 32BE 32LE; do`
			`iconv -t UTF$text ikiwiki-utf-test/raw/$page.$page > ikiwiki-utf-test/raw/$page-utf$text.$page;`
			`done`
			`done`
			`ikiwiki --verbose --plugin txt --plugin mdwn ikiwiki-utf-test/raw/ ikiwiki-utf-test/rendered/`
			`www-browser ikiwiki-utf-test/rendered/ \|\| x-www-browser ikiwiki-utf-test/rendered/`
			`# rm -r ikiwiki-utf-test/ # some browsers rather stupidly daemonize themselves, so this operation can't easily be safely automated`

			`BOMless LE and BE input is probably a lost cause.`

			Optimally, UTF-16 (which is ubiquitous in the Windows world) and UTF-32 should be fully supported, probably by converting to mostly-UTF-8 and using `&#xXXXX;` or `&#DDDDD;` XML escapes where necessary.

			`Suboptimally, UTF-16 and UTF-32 should be converted to UTF-8 where cleanly possible and a warning printed where impossible.`
add comment. 2011-11-04 11:31:35 +01:00
			`----`
			`Reading the wikipedia pages about [[!wikipedia UTF-8]] and [[!wikipedia UTF-16]], all valid Unicode characters are representable in UTF-8, UTF-16 and UTF-32, and the only errors possible with UTF-16/32 -> UTF-8 translation are when there are encoding errors in the original document.`

			`Of course, it's entirely possible that not all browsers support utf-8 correctly, and we might need to support the option of encoding into [[!wikipedia CESU-8]] instead, which has the side-effect of allowing the transcription of UTF-16 or UTF-32 encoding errors into the output byte-stream, rather than pedantically removing those bytes.`

			`An interesting question would be how to determine the character set of an arbitrary new file added to the repository, unless the repository itself handles character-encoding, in which case, we can just ask the repository to hand us a UTF-8 encoded version of the file.`

			`-- [[Martin Rudat\|http://www.toraboka.com/~mrudat]]`