ikiwiki/doc/plugins/contrib/i18nheadinganchors/discussion.mdwn

I would not be comfortable with merging this into headinganchors and enabling it by
default for two main reasons:

* it adds a new dependency on [[!cpan Text::Unidecode]]
* Text::Unidecode specifically documents its transliteration as not being stable
  across versions

There are several "slugify" libraries available other than Text::Unidecode.
It isn't clear to me which one is the best. Pandoc also documents
[an algorithm for generating slugs](http://pandoc.org/MANUAL.html#extension-auto_identifiers),
and it would be nice if our fallback implementation (with i18n disabled) was compatible
with Pandoc's, at least for English text.

However! In HTML5, IDs are allowed to contain anything except _space characters_
(space, newline, tab, CR, FF), so we could consider just passing non-ASCII
through the algorithm untouched. This [example link to a Russian
anchor name](#пример) (the output of putting "example" into English-to-Russian
Google Translate) hopefully works? (Use a small browser window to make it
clearer where it goes)

> Can we assume Ikiwiki generates HTML5 all the time? I thought that was still a 
> setting off by default... --[[anarcat]]

>> ikiwiki always generates HTML5, since 3.20150107. The `html5` option has
>> been repurposed to control whether we generate new-in-HTML5 semantic
>> markup like `<section>` and `<nav>` (`html5` enabled), or HTML4 equivalents
>> like `<div>` with a class (`html5` disabled). The default is still off,
>> although I should probably either toggle it to on or remove the option
>> altogether in the next release. --s

So perhaps we could try this Unicode-aware version of what Pandoc documents:

* Remove footnote links if any (this might have to be heuristic, or we could
  skip this step for a first implementation)
* Take only the plain text, no markup (passing the heading through HTML::Parser
  and collecting only the text nodes would be the fully-correct version of this,
  or we could fake it with regexes and be at least mostly correct)
* Strip punctuation, using some Unicode-aware definition of what is punctuation:
  perhaps `s/[^-\w_. ]//gu;` (delete anything that is not a (Unicode-aware) word
  character, hyphen-minus, underscore, dot or space)
* Replace spaces with hyphen-minus
* Force to lower-case with `lc`
* Strip leading digits and punctuation
* If the string is empty, use `section`
* If we already generated a matching identifier, append `-1`, `-2`, etc. until we find
  an unused identifier

(Or to provide better uniqueness, we could parse the document looking for any existing
ID, then append `-1`, `-2` to each generated ID until there is no collision.)

This would give us, for example, `## Visiting 北京` → `id="visiting-北京"`
(whereas Text::Unidecode would instead transliterate, resulting in
`id="visiting-bei-jing"`).

To use these IDs in fragments, I would be inclined to rely on browsers
supporting [IRIs](https://tools.ietf.org/html/rfc3987): `<a href="#visiting-北京">`.

--[[smcv]]

> I guess this makes sense. I just wonder how well this is actually supported in all
> browsers.. I looked around and suspect this will work in more recent browsers, but,
> as an example, https://caniuse.com/ doesn't have that feature listed in their 
> tables. :) -- [[anarcat]]

>> That might well indicate that all major browsers have always supported it so
>> there is no need to check. I don't see any particular reason why a browser vendor
>> would not want to accept arbitrary non-whitespace as a valid anchor.
>>
>> In practice, minor or old browsers are probably insecure anyway, so I don't care
>> too much about supporting them perfectly... --s

----

Documentation says:

> _Also note that all heading attributes are overriden with the ID tag. If this
> is not desirable, we'd need to fire up a full HTML::Parser or do some more
> regex magic to preserve the attributes other than id= which we want to keep._

I think this is a bug, particularly if you are using Pandoc's
[header attributes](http://pandoc.org/MANUAL.html#extension-header_attributes)
or similar.

> It's not a bug, it's a limitation. :) But sure, it's a thing. It's an issue in
> headinganchors as well of course. -- [[anarcat]]

>> No, current/historical headinganchors has a different bug: it ignores headings
>> that have any attributes, and does not generate anchors for them. That gives it
>> degraded functionality, but no information loss. I think that's less bad. --s

I think we should try to use an existing ID before generating our own, with the
generation step as a fallback, just like Pandoc does. If a htmlize layer like
Text::MultiMarkdown or Pandoc is generating worse IDs than this plugin, the
the right solution to that is to send a bug report / feature request to
make its IDs as good as this plugin's, or turn off ID generation in the
htmlize layer, or stop using Text::MultiMarkdown.

--[[smcv]]

> Agreed. However, the situation I was in was that multimarkdown *and* the 
> headinganchors plugins had issues I had to fix. So it was better and easier
> for me to just override whatever attributes were there for testing and 
> fixing this in the short term... -- [[anarcat]]

----

<pre>Some long scrollable text
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
<span id="пример">Example fragment ID in Russian should point here</span>
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.</pre>

> This works for me on ` Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0` on Debian stretch, FWIW. --[[anarcat]]
browsers and specifications support more Unicode than we give them credit for 2017-05-16 11:17:00 +02:00			`I would not be comfortable with merging this into headinganchors and enabling it by`
we should prefer existing IDs and only act as a fallback 2017-05-16 11:38:02 +02:00			`default for two main reasons:`
browsers and specifications support more Unicode than we give them credit for 2017-05-16 11:17:00 +02:00
			`* it adds a new dependency on [[!cpan Text::Unidecode]]`
			`* Text::Unidecode specifically documents its transliteration as not being stable`
			`across versions`

			`There are several "slugify" libraries available other than Text::Unidecode.`
			`It isn't clear to me which one is the best. Pandoc also documents`
			`[an algorithm for generating slugs](http://pandoc.org/MANUAL.html#extension-auto_identifiers),`
			`and it would be nice if our fallback implementation (with i18n disabled) was compatible`
			`with Pandoc's, at least for English text.`

			`However! In HTML5, IDs are allowed to contain anything except _space characters_`
			`(space, newline, tab, CR, FF), so we could consider just passing non-ASCII`
			`through the algorithm untouched. This [example link to a Russian`
			`anchor name](#пример) (the output of putting "example" into English-to-Russian`
			`Google Translate) hopefully works? (Use a small browser window to make it`
			`clearer where it goes)`

response 2017-06-01 15:14:23 +02:00			`> Can we assume Ikiwiki generates HTML5 all the time? I thought that was still a`
			`> setting off by default... --[[anarcat]]`

2017-06-01 15:59:36 +02:00			>> ikiwiki always generates HTML5, since 3.20150107. The `html5` option has
			`>> been repurposed to control whether we generate new-in-HTML5 semantic`
			>> markup like `<section>` and `<nav>` (`html5` enabled), or HTML4 equivalents
			>> like `<div>` with a class (`html5` disabled). The default is still off,
			`>> although I should probably either toggle it to on or remove the option`
			`>> altogether in the next release. --s`

browsers and specifications support more Unicode than we give them credit for 2017-05-16 11:17:00 +02:00			`So perhaps we could try this Unicode-aware version of what Pandoc documents:`

			`* Remove footnote links if any (this might have to be heuristic, or we could`
			`skip this step for a first implementation)`
			`* Take only the plain text, no markup (passing the heading through HTML::Parser`
			`and collecting only the text nodes would be the fully-correct version of this,`
			`or we could fake it with regexes and be at least mostly correct)`
			`* Strip punctuation, using some Unicode-aware definition of what is punctuation:`
			perhaps `s/[^-\w_. ]//gu;` (delete anything that is not a (Unicode-aware) word
			`character, hyphen-minus, underscore, dot or space)`
			`* Replace spaces with hyphen-minus`
			* Force to lower-case with `lc`
			`* Strip leading digits and punctuation`
			* If the string is empty, use `section`
			* If we already generated a matching identifier, append `-1`, `-2`, etc. until we find
			`an unused identifier`

			`(Or to provide better uniqueness, we could parse the document looking for any existing`
we should prefer existing IDs and only act as a fallback 2017-05-16 11:38:02 +02:00			ID, then append `-1`, `-2` to each generated ID until there is no collision.)
browsers and specifications support more Unicode than we give them credit for 2017-05-16 11:17:00 +02:00
			This would give us, for example, `## Visiting 北京` → `id="visiting-北京"`
we should prefer existing IDs and only act as a fallback 2017-05-16 11:38:02 +02:00			`(whereas Text::Unidecode would instead transliterate, resulting in`
			`id="visiting-bei-jing"`).
browsers and specifications support more Unicode than we give them credit for 2017-05-16 11:17:00 +02:00
			`To use these IDs in fragments, I would be inclined to rely on browsers`
			supporting [IRIs](https://tools.ietf.org/html/rfc3987): `<a href="#visiting-北京">`.

			`--[[smcv]]`

response 2017-06-01 15:14:23 +02:00			`> I guess this makes sense. I just wonder how well this is actually supported in all`
			`> browsers.. I looked around and suspect this will work in more recent browsers, but,`
			`> as an example, https://caniuse.com/ doesn't have that feature listed in their`
			`> tables. :) -- [[anarcat]]`
we should prefer existing IDs and only act as a fallback 2017-05-16 11:38:02 +02:00
2017-06-01 15:59:36 +02:00			`>> That might well indicate that all major browsers have always supported it so`
			`>> there is no need to check. I don't see any particular reason why a browser vendor`
			`>> would not want to accept arbitrary non-whitespace as a valid anchor.`
			`>>`
			`>> In practice, minor or old browsers are probably insecure anyway, so I don't care`
			`>> too much about supporting them perfectly... --s`

we should prefer existing IDs and only act as a fallback 2017-05-16 11:38:02 +02:00			`----`

			`Documentation says:`

			`> _Also note that all heading attributes are overriden with the ID tag. If this`
			`> is not desirable, we'd need to fire up a full HTML::Parser or do some more`
			`> regex magic to preserve the attributes other than id= which we want to keep._`

			`I think this is a bug, particularly if you are using Pandoc's`
			`[header attributes](http://pandoc.org/MANUAL.html#extension-header_attributes)`
			`or similar.`

response 2017-06-01 15:14:23 +02:00			`> It's not a bug, it's a limitation. :) But sure, it's a thing. It's an issue in`
			`> headinganchors as well of course. -- [[anarcat]]`

current headinganchors does not damage headings' attributes, although it does not act on those headings 2017-06-01 16:03:51 +02:00			`>> No, current/historical headinganchors has a different bug: it ignores headings`
			`>> that have any attributes, and does not generate anchors for them. That gives it`
			`>> degraded functionality, but no information loss. I think that's less bad. --s`

we should prefer existing IDs and only act as a fallback 2017-05-16 11:38:02 +02:00			`I think we should try to use an existing ID before generating our own, with the`
			`generation step as a fallback, just like Pandoc does. If a htmlize layer like`
			`Text::MultiMarkdown or Pandoc is generating worse IDs than this plugin, the`
			`the right solution to that is to send a bug report / feature request to`
			`make its IDs as good as this plugin's, or turn off ID generation in the`
			`htmlize layer, or stop using Text::MultiMarkdown.`

			`--[[smcv]]`

response 2017-06-01 15:14:23 +02:00			`> Agreed. However, the situation I was in was that multimarkdown and the`
			`> headinganchors plugins had issues I had to fix. So it was better and easier`
			`> for me to just override whatever attributes were there for testing and`
			`> fixing this in the short term... -- [[anarcat]]`

browsers and specifications support more Unicode than we give them credit for 2017-05-16 11:17:00 +02:00			`----`

			`<pre>Some long scrollable text`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
correct ID syntax 2017-05-16 11:17:57 +02:00			`<span id="пример">Example fragment ID in Russian should point here</span>`
browsers and specifications support more Unicode than we give them credit for 2017-05-16 11:17:00 +02:00			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.`
			`.</pre>`
response 2017-06-01 15:14:23 +02:00
			> This works for me on ` Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0` on Debian stretch, FWIW. --[[anarcat]]