Benchmarking refresh of a a wiki with 25 thousand pages showed
file_pruned() using most of the time. But, when refreshing, ikiwiki already
knows about nearly all the files. So we can skip calling file_pruned() for
those it knows about. While tricky to do, this sped up a refresh (that
otherwise does no work) by 10-50%.
This is very common, and the code has to test each type differently, since
the list of candidates to test, as well as the test, will vary per type.
Much happier with this code now.
This new method for determining when links on pages
have changed should be more efficient, since it avoids
double calculation of the bestlinks.
It also allows collecting data about the old links, before
the scan pass, so the data is accurate when pages move around
and bestlinks change.
Also got some code back to a saner indent level.
Involved some code refactoring so that same code that detects
link changes for backlinks updating can be used for link dependency
checking. The nice thing is that link dep checking is thus
comopletly free!
Preliminary support, anyway.
If a dependency only includes DEPEND_EXISTS, then only changes that
involved adding or deleting a page can trigger it.
This is complicated by internal pages, since the code did not previously
differentiate between add, delete, and change of internal pages.
Now it tracks change separately from add+delete, so DEPEND_EXISTS pagespecs
that actually match internal pages (which will probably be quite rare in
practice) should work.
As soon as a change happens, we know we will need to rescan all
dependencies from the start, so bail out of the current scan partway to
avoid doing redundant work.
Only problem with this is that ikiwiki sometimes ends up printing out
dependencies that, while correct, are not obvious. Before:
building B, which depends on A
building C, which depends on A
building D, which depends on A
After:
building B, which depends on A
building C, which depends on B
building D, which depends on C
It's not "exact" since case munging has to be done, and I think
"simple" captures the optimisation better.</pedant>
With apologies to smcv, who probably has to rebuild his wiki now.
Let E be the number of dependencies per page of the form "A depends on B and
nothing else", let D be the number of other dependencies per page,
let P be the total number of pages, and let C be the number of changed
pages in a refresh.
This patch should speed up a refresh from O(E*C*P + D*C*P) to
O(C + E*P + D*C*P), assuming that hash lookups are O(1).
In practice, plugins like inline and map produce a lot of these very simple
dependencies, and my album plugin's combination of inline with a large
number of pages causes it to suffer particularly badly.
In testing on a wiki with about 7000 objects (3500 full pages, 3500
images), a full rebuild continued to take about 5:30, and a refresh
after touching about 350 pages and 350 images reduced from 5:30 to 1:30.
As with my previous optimizations, this change will result in downgrades not
working correctly until the wiki is rebuilt.
This should be more efficient than pagespec_match_list since it short-circuits
after the first match is found.
The other problem with using pagespec_match_list here is it may throw an
error if a bad or failing pagespec somehow got into the dependencies.
On a large wiki you can spend a lot of time reading through large lists
of dependencies to see whether files need to be rebuilt (album, with its
one-page-per-photo arrangement, suffers particularly badly from this).
The dependency list is currently a single pagespec, but it's not used like
a normal pagespec - in practice, it's a list of pagespecs joined with the
"or" operator.
Accordingly, change it to be stored as a list of pagespecs. On a wiki
with many tagged photo albums, this reduces the time to refresh after
`touch tags/*.mdwn` from about 31 to 25 seconds.
Getting the benefit of this change on an existing wiki requires a rebuild.
During backlink calulation, all links are examined and broken links can
be detected for free, so store a list of broken links and have brokenlinks
use it.
Exposing the %brokenlinks structure is a bit ugly, but the speedup seems
worth it: Around 1 second for wikis the size of the doc wiki that use
brokenlinks.
By adding this setting, we get both more configurability, and a minor
optimisation too, since gettext does not need to be called continually
to get the Discussion value.