A more generic and multiwiki links storage

Hi everyone,

We are storing links found in documents and objects in a table to help with two things:

  • indicate then as information about the page
  • but more importantly, links to modify when renaming a page

The main limitation of the current storage for the issue I’m working on is that finding backlinks from other wikis is very slow, as we have to request all wikis.

But I would also like to take care of the following problem if I need to come up with a new link store:

  • the support for page links is a bit fuzzy (a document reference is arbitrarily chosen to represent it, but it might be the wrong one)
  • the table only supports document as link source and only document and attachment as link targets, it would be interesting to support more types (especially for the source)
  • it does not have any information about the target locale (it’s possible to indicate a specific locate with the page reference syntax, for example)

I started working on several possible ideas, and you can see the details on Generic multiwiki links (Proposal.Genericmultiwikilinks) - XWiki.

In short, I came up with two different proposals:

  • a new more generic database table stored both in the link and the target wiki
  • a Solr core

In terms of retro compatibility, whatever choice we make for the new storage, I think the best is to have a legacy platform extension which keeps populating the xwikilinks table the same way it currently does.

WDYT ?

My preference goes to Solr right now for the following reasons:

  • way easier to support a lot more types of source and targets, but even for the same types of links the storage is much simpler (mostly because we don’t have any limitation on the size of the references we index in it)
  • cross wiki by definition, so less duplication

But it does come with an important limitation too: impossible to write a single database request which includes links related metadata among other things. So if something has an existing use case like this one in mind it would be interesting to discuss it and if it’s still doable with a Solr core.

What about storing the link information in the existing Solr-based search index, allowing queries like “all pages linking to A.B with content containing ‘XWiki’”? This could mitigate parts of the problem with linking links and other kinds of data while also avoiding the need to create a new Solr core. We would need to trigger re-indexing the whole wiki after upgrading, though (and possibly modify the Solr core, I haven’t checked how it works). Introducing a more generic re-indexing mechanism might be a good idea, anyways.

Independent of that, I’m +1 for Solr as I think this is exactly the kind of information we should be storing in Solr.

I don’t see any advantage of keeping this in the database unless there are very specific queries you cannot do in Solr, so -0 for the database table option.

I honestly didn’t even though about that since I was focusing on designing a generic link index, but if we decide to limit supported sources types to model entities (which make sense anyway as I cannot really think of any other use case), I guess it can work in theory. We would need to decide what we do for the current generic task indexing system, as it’s not really needed anymore if links are indexed as part of the general Solr search core indexing.

Adding that to the design page.

+1

I was actually going to react on:

by asking if it wouldn’t be possible to include more information in our Solr search based index to allow performing same kind of query we can do in our SQL DB. I’m thinking for example about the xobject data: not sure if they are stored in the index.

But that could be obviously be another improvment to be done later.

The search core contains an entry for each data model entity (document, object, property) so I doubt we need to add more.

+1 for using Solr.

As for using the existing Solr search core, it’s not clear to me what <index> means in link_<index>_target. Do you mean that for a document entry we would have:

  • link_1_target for the first link found in the document content / meta data (xobjects)
  • link_2_target for the second link found
  • etc.

How will we use this dynamic field in Solr queries?

If link_targets includes the parents then does it mean that a link from X.Y to A.B.C will appear listed in the back-links of A.B and A? i.e. the user will see that there is a link from X.Y to A.B and A?

Thanks,
Marius

It’s still a preliminary design, but that’s the idea yes since a given source can contain several links. The point of those fields is to be retrieved, and not really to be used in a request.

The idea of that field is to find all entities that include links to a given entity at any granularity (you use it to search for all sources having a link to a given document you just moved, to a wiki you just deleted, etc.).

I just slightly modified the schema for the Solr search core based proposal (since it seems to be the one we are headed to) to simplify it (no more dedicated locale metadata) with the following constraint: each type of link has to find a way to fully serialize what it needs to express in a single String.

I believe we can always find a way to do that, and it’s a good thing anyway for various other use cases.

Furthermore, I also detailed a bit more this proposal with examples.