Asynchronous Document Analysis Task

Hi all,

We have noticed that several use cases related to document indexing/analysis could benefit from a common framework:

  • mentions identification (already has a dedicated solution but suffers from various shortcoming discussed below)
  • document reference indexing
  • attachment reference indexing (WIP)
  • possibly, dead-links analysis

Note that the first application will be for attachment reference indexing and the rest will need to be migrated overtime.

Migrating the document reference indexing will have an interesting benefit. Currently this time consuming task is performed synchronously when a document is saved, which makes saving a document slower (warning: but we loose the guarantee that recent references to a document will be taken into account during refactoring, to be discussed before actually performing this migration).

The requirements are the following:
We want to be able to keep a persistent queue of asynchronous tasks, and to have a low priority thread consuming the queued tasks.
A tasks is a combination of the reference to a given version of a document, and the kind of analysis to perform (e.g., mentions, links).

Use cases:

  • Queuing tasks at upgrade time (for instance, to make a first indexation of all the attachment references)
  • Queuing tasks when a document is saved
  • Queuing tasks when a document is removed
  • Being able to monitor the queue progress:
    • using JMX
    • with an UI in the administration, alike what we currently provide for solr document indexing (For this I propose a new Document Analysis sub-section below the Content section in the administration)

We can also distinguish two kinds of tasks:

  • strict: to be executed on each (major?) version of a document
  • “best effort”: to be eventually executed on the latest version of the document (it’s ok to skip intermediated versions)

Regarding the general architecture:

  • the UI related part will be added to xwiki-platform-index-ui
  • the Java code will be added to a new xwiki-platform-index-api

Persistence:
For the mentions, we have experimented to persist the queue using an H2 MVStore database. Overtime, we realized this approach suffers from one important shortcoming. When H2 upgrade to a major version, it is difficult to migrate to the next version (we currently remove the database and create a new one).
I propose to use a new table in the wiki database instead. This table would persist the tasks for the current wiki.
Note that in-memory, a single queue exists with the tasks of all the farm.

Example of table structure:

  • timestamp: needed to order the tasks when re-building the queue at startup
  • document_reference: the full reference (including the wiki id) of the document to analyze (e.g., xwiki:Space.document)
  • document_version: the version of the document to analyze
  • kind: the hint of the task to perform on the document (e.g., mentions, links)
  • attempts: the number of failed tries when performing the task

WDYT?

The idea is interesting, I have some questions though:

  1. Since it’s async, how do you handle possible conflicts?
  2. Do you plan to flag synchronously the fact that a document is inconsistent?
  3. Why saving the attempts number in DB, you plan to have a retry mechanism?

For 1. I’m thinking about two possible situations (talking about attachment move in my example):

  • the document is updated before the job is executed to remove the attachment reference: how the job should react? Just skip it? Warn?
  • the job update the document while someone is editing it: I guess it’s automatically covered with our mechanism to prevent conflicts, so should be ok

For 2. the UC I’m thinking is the export of documents: you’ve moved an attachment, but you update asynchronously the ref so some docs are not consistent anymore. If someone tries to export a doc using the attachment at the same time they won’t get the expected result IMO. So maybe it would worth it to flag immediately the doc as being inconsistent, so that we can handle this kind of UC: either by disabling some actions in UI until it’s consistent again, or by forcing the execution of the job when it’s required.

For 3. I’m wondering why it’s needed?

That’s a good point. The approach I had in mind was to keep it “best effort” but that might be an issue on a very busy wiki where the queue stays filled for a long time.

Another solution is to do something like intellij does and to block some actions (or show a warning) when some kind of task is present in the queue (for instance, blocking attachment move until all the attachment indexing are done)

Finally, I’m not fully sure how robust this third approach is, but we could keep a timestamped log of the refactoring actions (e.g., attachment X has been renamed to Y at timestamp Z) and to use that information when performing the analysis, and to also perform refactoring at the same time (not sure I’m clear, and that sound complex). Also, that implies to persist that log, and raises the question of when and how to remove outdated logs, otherwise it will quickly become space consuming (unless we use some heavyweight frameworks like kafka…).

The thing is, we don’t know which documents are impacted by the attachment move (and hence needs refactoring) until the asynchronous analysis is done. The only reliable solution I see is to simply block the move action until all the attachment analysis are done, which is not great.

Yes, it shouldn’t be happening often but we have put in it place for the mentions and I think it’s worth keeping. For random reasons, some part of an otherwise correct code can fail and I think it’s interesting to retry a limited amounted of time (but not too many, in case the analysis code actually has a bug, for the mentions it’s hard-coded to 10).

For some context, what we minimally need is to have is a persistent asynchronous queue of attachments analysis in all the documents of a farm during the migration process to 14.1RC1+.
We could also keep the attachment analysis synchronous on save, with the performance cost that this implies.
I’ll try to have an estimate of the time spent on com.xpn.xwiki.store.XWikiHibernateStore#saveLinks to decide if going fully asynchronous is worth the effort.

Sounds like we would need a new configuration option to be able to switch off the async mechanism if needed, or to be able to chose the features for which is switched on/off.

Note that after discussing with @tmortagne I think by, per default we can consider links/attachments indexing and mentions analysis to be non-critical and can suffer from links and to be missed during refactoring in case of lag in the processing queue.

+1 to have the option to allow admins to make a kind of tasks “critical” and hence, blocking.

Link indexing in general since it should be done by the same code, the only different is the type of the entity reference. It would not make any sense to parse once for document links and again for attachment links.

I don’t understand. How is this different than a page move where we use a background asynchronous job and the UI shows a nice progress bar?

In think the use cases are different, job provides, progress information, live logs and a question/answer system which are not needed for tasks. Otoh, tasks are persisted and will be executed even of the server is restarted before some tasks are executed, which is not the case for jobs (unless they implement their own persistence mechanism).

What I don’t understand is why you need to block the move action. Maybe you also need to explain what you mean by blocking the move action.

From the point of the view of the user the move action is asynchronous, performed in a background job. AFAIU we can have two situations:

  • the backlinks table is up to date when the move operation is triggered → the backlinks can be updated right away
  • the backlinks table is not up to date → the move job has to wait before updating the links

But in both cases, from the user point of view, the move job does some (in the background), as shown in the progress log. So how exactly is the move job blocked?

What you propose works too I agree. In this case, if some link indexing tasks are not finished yet during the move, we’d need to display some time estimation to the user. Otherwise, they might wait for an undetermined time before the job is actually done.