Thought of a cleaner way to accumulate all influences in
pagespec_match_list, using the pagespec_match result object as an
accumulator.
(This also accumulates all influences from failed matches, rather than just
one failed match. I'm not sure if the old method was correct.)
Benchmarking refresh of a a wiki with 25 thousand pages showed
file_pruned() using most of the time. But, when refreshing, ikiwiki already
knows about nearly all the files. So we can skip calling file_pruned() for
those it knows about. While tricky to do, this sped up a refresh (that
otherwise does no work) by 10-50%.
If a pagespec fails to match, I had been throwing the influences away, but
that is not right. Consider `backlink(foo)`, where foo does not exist.
It still needs to be added as an influence, because if it is created, it
will influence the pagespec to match.
But with that fix, `link(bar)` had as influences all pages, whether they
link to bar or not. Which is not necessary, because modifiying a page to
add a link to bar will directly cause the pagespec to match.
So, in match_link (and all the match_* functions for page metadata),
only return an influence if the match succeeds.
match_backlink had been implemented as the inverse of match_link, but that
is no longer completly true. While match_link does not return an influence
on failure, match_backlink does.
match_created_before/after also return the influence on failure, this way
if created_after(foo) currently fails because foo does not exist, it will
still update the page with the pagespec if foo is created.
The hash will be used used to record a set of pages that influenced the
result of a pagespec match.
The influences are merged together when boolean and/or are encountered
in a pagespec. That means using a non-short-circuiting OR operator. And
so I use & and | when translating pagespecs, since those bitwise operators
can be overloaded. ("and" and "or" cannot, apparently).
Involved some code refactoring so that same code that detects
link changes for backlinks updating can be used for link dependency
checking. The nice thing is that link dep checking is thus
comopletly free!
When adding a contentless dependency, the pagespec also needs to be one
that does not look at any page content information.
As a first approximation of that, only allow glob-based pagespecs in
contentless dependencies. While there are probably a few other types of
pagespecs that can match contentless, this will work for most of them.
Dependency types are represented by bits in the values of the %depends
and %depends_simple hashes.
Change the dependslist array saved to the index to a depends hash.
depends_simple is also converted from an array to a hash.
Note that the depends field used to be a string, and we still
have compat code to handle upgrades from that, as well as from the arrays.
I didn't use ikiwiki-transition because I don't want ikiwiki to break if
users forget to run it; also we're going to recommend a full rebuild on
upgrade to this version to get the improved dependency handling. So
this compat code can be removed or moved to ikiwiki-transition later.
Here I was bitten by perl's aliasing of foreach variables
to the loop array contents, and match_link accidentially changed
the contents of %links.
In Jon's testcase, a tag added an absolute link, which was
made relative by the above bug, and then the link was added
again in preprocess, and turned into a duplicate.
I weakended the regexp, so this matches ipv6 addresses too. It does not
ensure that the address is valid, but that should not matter here.
Note that addresses ending in "::" are not matched, so eg, the unspecified
address will not match -- but should never appear here anyway.
It's not "exact" since case munging has to be done, and I think
"simple" captures the optimisation better.</pedant>
With apologies to smcv, who probably has to rebuild his wiki now.
Let E be the number of dependencies per page of the form "A depends on B and
nothing else", let D be the number of other dependencies per page,
let P be the total number of pages, and let C be the number of changed
pages in a refresh.
This patch should speed up a refresh from O(E*C*P + D*C*P) to
O(C + E*P + D*C*P), assuming that hash lookups are O(1).
In practice, plugins like inline and map produce a lot of these very simple
dependencies, and my album plugin's combination of inline with a large
number of pages causes it to suffer particularly badly.
In testing on a wiki with about 7000 objects (3500 full pages, 3500
images), a full rebuild continued to take about 5:30, and a refresh
after touching about 350 pages and 350 images reduced from 5:30 to 1:30.
As with my previous optimizations, this change will result in downgrades not
working correctly until the wiki is rebuilt.
Now that dependencies are a list of pagespecs with an implicit "or"
operation, there's no need to try to merge pagespecs under normal use.
ikiwiki-transition contains the only use of the function, so move
it there rather than deleting it entirely (it's used to concatenate all
admins' lists of locked pages).
On a large wiki you can spend a lot of time reading through large lists
of dependencies to see whether files need to be rebuilt (album, with its
one-page-per-photo arrangement, suffers particularly badly from this).
The dependency list is currently a single pagespec, but it's not used like
a normal pagespec - in practice, it's a list of pagespecs joined with the
"or" operator.
Accordingly, change it to be stored as a list of pagespecs. On a wiki
with many tagged photo albums, this reduces the time to refresh after
`touch tags/*.mdwn` from about 31 to 25 seconds.
Getting the benefit of this change on an existing wiki requires a rebuild.
By adding this setting, we get both more configurability, and a minor
optimisation too, since gettext does not need to be called continually
to get the Discussion value.
On various sites I have two IkiWiki instances running from the same
repository: one accessible via http and only accepting openid logins,
and one accessible via authenticated https and only accepting httpauth.
The https version should still pretty-print OpenIDs seen in git history,
even though it does not itself accept OpenID logins.
The test suite was emitting a lot of ugly gettext warnings;
setting LC_ALL didn't solve the problem for all locale setups
(since ikiwiki remaps it to LANG, and ikiwiki didn't know about
the C locale).
People also seem generally annoyed by the messages when
Locale::Gettext is not installed, and I suspect will be
generally happier if it just silently doesn't localize.
The optimisation came about when I noticed that the gettext
sub was doing rather a lot of work each call just to see
if localisation is needed. We can avoid that work by caching,
and the best thing to cache is a version of the gettext sub
that does exactly the right thing.
This was slightly complicated by the locale setting,
which might need to override the original locale (or lack
thereof) after gettext has been called. So it needs to invalidate
the cache in that case. It used to do it via a global variable,
which I am happy to have also gotten rid of.
Do not allow an unterminated """ string to be treated as a series of bare
words. Fixes runaway regexp recursion/backtracking in strange situations.
(See 1d57a21c98 for test case.)
And avoid a whole class of potential security problems (though
none that I know of actually existing..), by avoiding
performing any string interpolation on user-supplied data when translating
pagespecs.
This is sorta an optimisation, and sorta a bug fix. In one
test case I have available, it can speed a page build up from 3
minutes to 3 seconds.
The root of the problem is that $links{$page} contains arrays of
links, rather than hashes of links. And when a link is found,
it is just pushed onto the array, without checking for dups.
Now, the array is emptied before scanning a page, so there
should not be a lot of opportunity for lots of duplicate links
to pile up in it. But, in some cases, they can, and if there
are hundreds of duplicate links in the array, then scanning it
for matching links, as match_link and some other code does,
becomes much more expensive than it needs to be.
Perhaps the real right fix would be to change the data structure
to a hash. But, the list of links is never accessed like that,
you always want to iterate through it.
I also looked at deduping the list in saveindex, but that does
a lot of unnecessary work, and doesn't completly solve the problem.
So, finally, I decided to add an add_link function that handles deduping,
and make ikiwiki-transition remove the old dup links.
This reverts commit 2f96c49bd1.
I forgot about internal pages. We don't want * matching them!
I left the optimisation in pagecount, where it used to live.
Internal pages probably don't matter when they're just being
counted.
* pagespec_match_list: New API function, matches pages in a list
and throws an error if the pagespec is bad.
* inline, brokenlinks, calendar, linkmap, map, orphans, pagecount,
pagestate, postsparkline: Display a handy error message if the pagespec
is erronious.
* Add IkiWiki::ErrorReason objects, and modify pagespecs to return
them in cases where they fail to match due to a configuration or syntax
error.
* inline: Display a handy error message if the inline cannot display any
pages due to such an error.
This is perhaps somewhat incomplete, as other users of pagespecs do not
display the error, and will eventually need similar modifications to inline.
I should probably factor out a pagespec_match_all function and make it throw
ErrorReasons.
The problem was introduced by the recent noextension patches.
Object autovivification caused junk to get into %htmlize,
and all keys of that showed up as page types.
Because getopt::long is used in passthrough mode, if a known
option like --wikiname that needs a parameter is specified w/o
the parameter, it will not be processed, and passed on through.
So in this case the "unknown option" message is innaccurate.
Make it slightly better by noting that the problem can be a missing
parameter.
This modification was initially done in editpage, in commit
a3726968bc, but was then lost while merging
upstream/master branch.
Signed-off-by: intrigeri <intrigeri@boum.org>
It no longer makes sense to keep these functions in editpage, because
serveral plugins now exist that use them, and users may want to disable
editpage, while leaving those plugins enabled.
Most notably, comments uses both functions, and it's entirely appropriate
to disable editpage but still want to have comments enabled.
Less likely, attachments, rename, and remove all use check_canedit -- but
it would be unusual indeed to want to use these w/o editpage.
It used to replace unknown functions with "0" when translating a pagespec.
Instead, replace it with a FailReason object. This way, the pagespec will
still evaluate as before (possibly successfully if other terminals exist),
but a human-readable error will be shown if the result is displayed.
Also, an empty pagespec used to be replaced with "0", to avoid a eval
error. Also use a FailReason here.
It seems to be a failing of i18n in unix that the translation stops at the
commands and the parameters to them, and ikiwiki is no exception with its
currently untranslated directives. So the little bit that's translated sticks
out like a sore thumb. It also breaks building of wikis if a different locale
happens to be set.
I suppose the best thing to do is either give up on the localisation of this
part completly, or make it recognise English in addition to the locale. I've
tenatively chosen the latter.
(Also accept 1 and 0 as input.)
inline has a format hook that is an optimisation hack. Until this hook
runs, the inlined content is not present on the page. This can prevent
other format hooks, that process that content, from acting on inlined
content. In bug ##509710, we discovered this happened commonly for the
embed plugin, but it could in theory happen for many other plugins (color,
cutpaste, etc) that use format to fill in special html after sanitization.
The ordering was essentially random (hash key order). That's kinda a good
thing, because hooks should be independent of other hooks and able to run
in any order. But for things like inline, that just doesn't work.
To fix the immediate problem, let's make hooks able to be registered as
running "first". There was already the ability to make them run "last".
Now, this simple first/middle/last ordering is obviously not going to work
if a lot of things need to run first, or last, since then we'll be back to
being unable to specify ordering inside those sets. But before worrying about
that too much, and considering dependency ordering, etc, observe how few
plugins use last ordering: Exactly one needs it. And, so far, exactly one
needs first ordering. So for now, KISS.
Another implementation note: I could have sorted the plugins with
first/last/middle as the primary key, and plugin name secondary, to get a
guaranteed stable order. Instead, I chose to preserve hash order. Two
opposing things pulled me toward that decision:
1. Since has order is randomish, it will ensure that no accidental
ordering assumptions are made.
2. Assume for a minute that ordering matters a lot more than expected.
Drastically changing the order a particular configuration uses could
result in a lot of subtle bugs cropping up. (I hope this assumption is
false, partly due to #1, but can't rule it out.)
A malformed pagespec will cause $@ to be set when translated, but if
it is used a second time, the memoization will defeat that check. Better to
check for the result not being defined.
This avoids constructing urls like "./../foo/".
The leading "../" avoids any colon confusion already.
I noticed in my logs that certain badly written web spiders (hello again,
Yahoo!) fail to follow urls like ikiwiki was constructing to the right
place (instead ending up at "./foo/")
Since ikiwiki uses open :utf8, perl assumes that files contain valid utf-8.
If it turns out to be malformed it may later crash while processing strings
read from them, with 'Malformed UTF-8 character (fatal)'.
As at least a quick fix, use utf8::valid as soon as data is read, and if
it's not valid, call encode_utf8 on the string, thus clearing the utf-8
flag. This may cause follow-on encoding problems, but will avoid this
crash, and the input file was broken anyway, so GIGO is a reasonable
response. (I looked at calling decode_utf8 after, but it seemed to cause
more trouble than it was worth. BTW, use open ':encoding(utf8)' avaoids
this problem, but the corrupted data later causes Storable to crash when
writing the index.)
This is a quick fix, clearly imperfect:
- It might be better to explicitly call decode_utf8 when reading files,
rather than using the IO layer.
- Data read other than by readfile() can still sneak in bad utf-8. While
ikiwiki does very little file input not using it, stdin for the CGI
would be one way.
This is necessary so that things that fork to the background,
like pinger, and inline ping, don't block other cgis from running.
Note that websetup also calls unlockwiki, before refreshing / rebuilding
the wiki. It makes perfect sense for that not to block other cgis.
* Stop busy-waiting in lockwiki, as this could delay ikiwiki from waking up
for up to one second. The bailout code is no longer needed.
* Remove support for unused optional wait parameter from lockwiki.
This fixes a problem exposed by the recent change to tags
(a2839de936). That recorded tag links as
absolute by including a leading slash in the link. The same could also be
done with an absolute wikilink.
In either case, link() would not match such links, unless the leading slash
was included in the link to match. But that's not right, because pagespecs
match absolute by default. So strip the leading slash.
Note that to keep any existing `link(/foo)` pagespecs working after this
change, the leading slash is removed from there, too.