Hi all,
We have noticed that several use cases related to document indexing/analysis could benefit from a common framework:
- mentions identification (already has a dedicated solution but suffers from various shortcoming discussed below)
- document reference indexing
- attachment reference indexing (WIP)
- possibly, dead-links analysis
Note that the first application will be for attachment reference indexing and the rest will need to be migrated overtime.
Migrating the document reference indexing will have an interesting benefit. Currently this time consuming task is performed synchronously when a document is saved, which makes saving a document slower (warning: but we loose the guarantee that recent references to a document will be taken into account during refactoring, to be discussed before actually performing this migration).
The requirements are the following:
We want to be able to keep a persistent queue of asynchronous tasks, and to have a low priority thread consuming the queued tasks.
A tasks is a combination of the reference to a given version of a document, and the kind of analysis to perform (e.g., mentions, links).
Use cases:
- Queuing tasks at upgrade time (for instance, to make a first indexation of all the attachment references)
- Queuing tasks when a document is saved
- Queuing tasks when a document is removed
- Being able to monitor the queue progress:
- using JMX
- with an UI in the administration, alike what we currently provide for solr document indexing (For this I propose a new Document Analysis sub-section below the Content section in the administration)
We can also distinguish two kinds of tasks:
- strict: to be executed on each (major?) version of a document
- “best effort”: to be eventually executed on the latest version of the document (it’s ok to skip intermediated versions)
Regarding the general architecture:
- the UI related part will be added to
xwiki-platform-index-ui
- the Java code will be added to a new
xwiki-platform-index-api
Persistence:
For the mentions, we have experimented to persist the queue using an H2 MVStore database. Overtime, we realized this approach suffers from one important shortcoming. When H2 upgrade to a major version, it is difficult to migrate to the next version (we currently remove the database and create a new one).
I propose to use a new table in the wiki database instead. This table would persist the tasks for the current wiki.
Note that in-memory, a single queue exists with the tasks of all the farm.
Example of table structure:
- timestamp: needed to order the tasks when re-building the queue at startup
-
document_reference: the full reference (including the wiki id) of the document to analyze (e.g.,
xwiki:Space.document
) - document_version: the version of the document to analyze
- kind: the hint of the task to perform on the document (e.g., mentions, links)
- attempts: the number of failed tries when performing the task
WDYT?