151 lines
5.4 KiB
Markdown
151 lines
5.4 KiB
Markdown
I would not be comfortable with merging this into headinganchors and enabling it by
|
|
default for two main reasons:
|
|
|
|
* it adds a new dependency on [[!cpan Text::Unidecode]]
|
|
* Text::Unidecode specifically documents its transliteration as not being stable
|
|
across versions
|
|
|
|
There are several "slugify" libraries available other than Text::Unidecode.
|
|
It isn't clear to me which one is the best. Pandoc also documents
|
|
[an algorithm for generating slugs](http://pandoc.org/MANUAL.html#extension-auto_identifiers),
|
|
and it would be nice if our fallback implementation (with i18n disabled) was compatible
|
|
with Pandoc's, at least for English text.
|
|
|
|
However! In HTML5, IDs are allowed to contain anything except _space characters_
|
|
(space, newline, tab, CR, FF), so we could consider just passing non-ASCII
|
|
through the algorithm untouched. This [example link to a Russian
|
|
anchor name](#пример) (the output of putting "example" into English-to-Russian
|
|
Google Translate) hopefully works? (Use a small browser window to make it
|
|
clearer where it goes)
|
|
|
|
> Can we assume Ikiwiki generates HTML5 all the time? I thought that was still a
|
|
> setting off by default... --[[anarcat]]
|
|
|
|
>> ikiwiki always generates HTML5, since 3.20150107. The `html5` option has
|
|
>> been repurposed to control whether we generate new-in-HTML5 semantic
|
|
>> markup like `<section>` and `<nav>` (`html5` enabled), or HTML4 equivalents
|
|
>> like `<div>` with a class (`html5` disabled). The default is still off,
|
|
>> although I should probably either toggle it to on or remove the option
|
|
>> altogether in the next release. --s
|
|
|
|
So perhaps we could try this Unicode-aware version of what Pandoc documents:
|
|
|
|
* Remove footnote links if any (this might have to be heuristic, or we could
|
|
skip this step for a first implementation)
|
|
* Take only the plain text, no markup (passing the heading through HTML::Parser
|
|
and collecting only the text nodes would be the fully-correct version of this,
|
|
or we could fake it with regexes and be at least mostly correct)
|
|
* Strip punctuation, using some Unicode-aware definition of what is punctuation:
|
|
perhaps `s/[^-\w_. ]//gu;` (delete anything that is not a (Unicode-aware) word
|
|
character, hyphen-minus, underscore, dot or space)
|
|
* Replace spaces with hyphen-minus
|
|
* Force to lower-case with `lc`
|
|
* Strip leading digits and punctuation
|
|
* If the string is empty, use `section`
|
|
* If we already generated a matching identifier, append `-1`, `-2`, etc. until we find
|
|
an unused identifier
|
|
|
|
(Or to provide better uniqueness, we could parse the document looking for any existing
|
|
ID, then append `-1`, `-2` to each generated ID until there is no collision.)
|
|
|
|
This would give us, for example, `## Visiting 北京` → `id="visiting-北京"`
|
|
(whereas Text::Unidecode would instead transliterate, resulting in
|
|
`id="visiting-bei-jing"`).
|
|
|
|
To use these IDs in fragments, I would be inclined to rely on browsers
|
|
supporting [IRIs](https://tools.ietf.org/html/rfc3987): `<a href="#visiting-北京">`.
|
|
|
|
--[[smcv]]
|
|
|
|
> I guess this makes sense. I just wonder how well this is actually supported in all
|
|
> browsers.. I looked around and suspect this will work in more recent browsers, but,
|
|
> as an example, https://caniuse.com/ doesn't have that feature listed in their
|
|
> tables. :) -- [[anarcat]]
|
|
|
|
>> That might well indicate that all major browsers have always supported it so
|
|
>> there is no need to check. I don't see any particular reason why a browser vendor
|
|
>> would not want to accept arbitrary non-whitespace as a valid anchor.
|
|
>>
|
|
>> In practice, minor or old browsers are probably insecure anyway, so I don't care
|
|
>> too much about supporting them perfectly... --s
|
|
|
|
----
|
|
|
|
Documentation says:
|
|
|
|
> _Also note that all heading attributes are overriden with the ID tag. If this
|
|
> is not desirable, we'd need to fire up a full HTML::Parser or do some more
|
|
> regex magic to preserve the attributes other than id= which we want to keep._
|
|
|
|
I think this is a bug, particularly if you are using Pandoc's
|
|
[header attributes](http://pandoc.org/MANUAL.html#extension-header_attributes)
|
|
or similar.
|
|
|
|
> It's not a bug, it's a limitation. :) But sure, it's a thing. It's an issue in
|
|
> headinganchors as well of course. -- [[anarcat]]
|
|
|
|
>> No, current/historical headinganchors has a different bug: it ignores headings
|
|
>> that have any attributes, and does not generate anchors for them. That gives it
|
|
>> degraded functionality, but no information loss. I think that's less bad. --s
|
|
|
|
I think we should try to use an existing ID before generating our own, with the
|
|
generation step as a fallback, just like Pandoc does. If a htmlize layer like
|
|
Text::MultiMarkdown or Pandoc is generating worse IDs than this plugin, the
|
|
the right solution to that is to send a bug report / feature request to
|
|
make its IDs as good as this plugin's, or turn off ID generation in the
|
|
htmlize layer, or stop using Text::MultiMarkdown.
|
|
|
|
--[[smcv]]
|
|
|
|
> Agreed. However, the situation I was in was that multimarkdown *and* the
|
|
> headinganchors plugins had issues I had to fix. So it was better and easier
|
|
> for me to just override whatever attributes were there for testing and
|
|
> fixing this in the short term... -- [[anarcat]]
|
|
|
|
----
|
|
|
|
<pre>Some long scrollable text
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
<span id="пример">Example fragment ID in Russian should point here</span>
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.
|
|
.</pre>
|
|
|
|
> This works for me on ` Mozilla/5.0 (X11; Linux x86_64; rv:50.0) Gecko/20100101 Firefox/50.0` on Debian stretch, FWIW. --[[anarcat]]
|