2008-09-23 10:58:35 +02:00
it appears that unicode characters in the title that are unicode letters are spared the __ filename encoding but instead saved in their utf8 encoding. (correct me if i'm wrong; didn't find the code that does this.) -- see below for examples.
git: Fix handling of utf-8 filenames in recentchanges.
Seems that the problem is that once the \nnn coming from git is converted
to a single character, decode_utf8 decides that this is a standalone
character, and not part of a multibyte utf-8 sequence, and so does nothing.
I tried playing with the utf-8 flag, but that didn't work. Instead, use
decode("utf8"), which doesn't have the same qualms, and successfully
decodes the octets into a utf-8 character.
Rant:
Think for a minute about fact that any and every program that parses git-log,
or git-show, etc output to figure out what files were in a commit needs to
contain this snippet of code, to convert from git-log's wacky output to a
regular character set:
if ($file =~ m/^"(.*)"$/) {
($file=$1) =~ s/\\([0-7]{1,3})/chr(oct($1))/eg;
}
(And it's only that "simple" if you don't care about filenames with
embedded \n or \t or other control characters.)
Does that strike anyone else as putting the parsing and conversion in the
wrong place (ie, in gitweb, ikiwiki, etc, etc)? Doesn't anyone who actually
uses git with utf-8 filenames get a bit pissed off at seeing \xxx\xxx
instead of the utf-8 in git-commit and other output?
2008-09-26 00:26:42 +02:00
> Filenames can have any alphanumerics in them without the __ escaping.
> Your locale determines whether various unicode characters are considered
2009-10-02 15:08:22 +02:00
> alphanumeric. In other words, it just looks at the \[[:alpha:]] character
git: Fix handling of utf-8 filenames in recentchanges.
Seems that the problem is that once the \nnn coming from git is converted
to a single character, decode_utf8 decides that this is a standalone
character, and not part of a multibyte utf-8 sequence, and so does nothing.
I tried playing with the utf-8 flag, but that didn't work. Instead, use
decode("utf8"), which doesn't have the same qualms, and successfully
decodes the octets into a utf-8 character.
Rant:
Think for a minute about fact that any and every program that parses git-log,
or git-show, etc output to figure out what files were in a commit needs to
contain this snippet of code, to convert from git-log's wacky output to a
regular character set:
if ($file =~ m/^"(.*)"$/) {
($file=$1) =~ s/\\([0-7]{1,3})/chr(oct($1))/eg;
}
(And it's only that "simple" if you don't care about filenames with
embedded \n or \t or other control characters.)
Does that strike anyone else as putting the parsing and conversion in the
wrong place (ie, in gitweb, ikiwiki, etc, etc)? Doesn't anyone who actually
uses git with utf-8 filenames get a bit pissed off at seeing \xxx\xxx
instead of the utf-8 in git-commit and other output?
2008-09-26 00:26:42 +02:00
> class, whatever your locale defines it to be. --[[Joey]]
2008-09-23 10:58:35 +02:00
this is not a problem per se, but (at least with git backend) the recent changes missinterpret the file name character set (it seems to read the filenames as latin1) and both display wrong titles and create broken links.
the problem can be shown with an auto-setup'd ikiwiki without cgi when manually creating utf8 encoded filenames and running ikiwiki with LANG=en_GB.UTF-8 .
git: Fix handling of utf-8 filenames in recentchanges.
Seems that the problem is that once the \nnn coming from git is converted
to a single character, decode_utf8 decides that this is a standalone
character, and not part of a multibyte utf-8 sequence, and so does nothing.
I tried playing with the utf-8 flag, but that didn't work. Instead, use
decode("utf8"), which doesn't have the same qualms, and successfully
decodes the octets into a utf-8 character.
Rant:
Think for a minute about fact that any and every program that parses git-log,
or git-show, etc output to figure out what files were in a commit needs to
contain this snippet of code, to convert from git-log's wacky output to a
regular character set:
if ($file =~ m/^"(.*)"$/) {
($file=$1) =~ s/\\([0-7]{1,3})/chr(oct($1))/eg;
}
(And it's only that "simple" if you don't care about filenames with
embedded \n or \t or other control characters.)
Does that strike anyone else as putting the parsing and conversion in the
wrong place (ie, in gitweb, ikiwiki, etc, etc)? Doesn't anyone who actually
uses git with utf-8 filenames get a bit pissed off at seeing \xxx\xxx
instead of the utf-8 in git-commit and other output?
2008-09-26 00:26:42 +02:00
> Encoding issue, I figured out a fix. [[done]] --[[Joey]]
2008-09-23 10:58:35 +02:00
2008-09-26 15:05:01 +02:00
>> the link text works now, but the link goes to
>> `ikiwiki.cgi?page=uml%C3%A4ute&do=recentchanges_link`, which fails with
>> "missing page". it seems that bestlink can't handle utf8 encoded texts. (the
>> same happens, by the way, when using meta-redir to a page with high bytes in
>> the name.)
2008-09-26 18:44:09 +02:00
>>
2008-09-26 21:40:01 +02:00
>>> The problem is that all cgi inputs have to be explicitly decoded to
>>> utf-8, which I've now done for `recentchange_link`.
2008-09-28 11:47:20 +02:00
>>>> thanks a lot, i think that closed the bug.
2008-09-26 21:40:01 +02:00
>>>
>>> I cannot, however, reproduce a problem with meta redir. Here it
>>> generated the following html, which redirected the browser ok:
>>> <meta http-equiv="refresh" content="0; URL=./../â/" />
2008-09-28 11:47:20 +02:00
>>>> sorry, my fault -- it was the blank which needed to be replaced by an
>>>> underscore, not the high byte character
2008-09-26 21:40:01 +02:00
>>
2008-09-26 18:44:09 +02:00
>> update: i've had a look at the git options; you could run git with '-z' (NUL
>> termination) in the `git_commit_info` function; this would require some
>> changes in `parse_diff_tree`, but otherwise completely eliminate the
>> problems with git escaping.
2008-09-26 15:05:01 +02:00
>>
2008-09-26 21:40:01 +02:00
>>> If you would like to develop a patch to that effect, I'd be glad to
>>> drop the current nasty code.
2008-09-28 11:47:20 +02:00
>>>> i'll have a look, but i'm afraid that's above my current perl skills.
>>>> --[[chrysn]]