Confluence Import and Nested Pages Migration for deep hierarchies/long names

Vertganti · February 16, 2021, 9:42am

We are attempting to import multiple large Confluence spaces with hierarchies going several levels deep and many of the pages having very long names into XWiki. Due to this we cannot execute the Nested Pages Migration for these spaces, as the resulting paths will exceed 255 characters. It would of course be possible to manually rename or reorganize the pages, but for hundreds of pages this gets very time-consuming and I assume that other companies migrating their production spaces will encounter the same issue. I therefore think it might be preferable to adapt the Confluence importer and/or Nested Pages Migrator to truncate the page names (but not the page titles) automatically, especially since XWiki SAS will be providing the simple importer UI.

I have attempted to solve this by adding an option to truncate long page names to the Confluence importer. However I am not sure if this is the correct way to do this. Using this code the page names will be changed, but the page title will be kept. However backlinks to the truncated pages break, since the links use the page title (this could probably be handled when resolving links). Unfortunately, the Nested Pages migrator will undo the truncation for all parent pages. Only the pages at the bottom of the hierarchy will keep their truncated names, all above them will be renamed back to their original title. I do not know if I have overlooked some place I should have specified the truncated name of the parent page or if the Nested Pages migrator uses the page titles during migration. I tried looking into the migrator code but I wasn’t able to find the place where I would have to make changes.

Is adding the functionality for automatically truncating page names an option for the Devs or XWiki SAS (since they stated they would work on the importer and Nested Pages migrator)? Or is there a better way we could make imports with long path names work? Asking potential users to rename and reorganize much of their data does not feel like a good option to me.

Thanks for your help!

BTW is there a reason the most recent changes to the NPM haven’t been released? When using the version from GitHub I realized a couple of things had changed (nothing solving the stated problem), but the last real commit was in 2017.

EDIT: The used versions for clarification:
XWiki: 13.0
Confluence importer: 9.5.1 (from extension manager) and 9.5.2-SNAPSHOT (as in above commit)
Nested Pages migrator: 0.7 (from extension manager) and 0.8-SNAPSHOT (from the official repo)

lucaa · February 17, 2021, 4:15pm

Hello Vertganti,

given the different data model between XWiki and Confluence, providing an option to import shorter names from Confluence is definitely one way to go.

I have recently looked at this problem and to me a valid option seems to be to use the Confluence page id as page name instead of its title (while keeping, of course, the title from confluence as title in XWiki so that it displays correctly on the page tree and everywhere). This way, the page names would be shorter and hopefully short enough to be able to then cleanly run the nested pages migrator to reproduce the same page tree as in Confluence.
This would have an advantage over truncation because it would ensure uniqueness of names and make it also easier to resolve links (which, as you noticed, can be an issue).

The way I see it, this would be an option of the confluence migrator, that users could choose when importing hierarchies for which they know that they want to restore parents and for which the risk is high, but would not be mandatory for the import.

Does this look like an acceptable solution to you, from the point of view of the data storage, page names and page URLs in XWiki?
Note that this doesn’t mean I have the fix for it but this would be a potentially “easier” fix and I am interested if, for you, as a user, this would be acceptable.

Thanks,
Anca

lucaa · February 17, 2021, 4:29pm

Just to clarify, this is also because we don’t have a universal measure of how short would be short enough, since we don’t resolve full hierarchies at import time, as far as I know, so we cannot know at import time what would be the convenient epsilon to which the truncation would suffice (page ids may not be enough either, actually, but we have a better chance).

ludovic · February 17, 2021, 4:50pm

Hi Vertganti,

I’m in the middle of the same issue right now with a client on a large space.
I have been also struggling to decide what the best solution would be.
At this point I’ve implemented a separate tool that would shorten page names but also guarantee unicity and shorten them in a nice way. Currently it’s splitting the title in words and keep as many words to keep the title under 30 characters.

It’s not 100% successful yet for a few reasons, one of them being that even at 30 characters with a deep hierarchy there is no guarantee the hierarchy won’t end up over 255.

To add to the problem, being under 255 does not guarantee that Nested Pages Migrator will work as there can be failure because of the notifications module which is adding the wiki prefix to the name and also has a 255 character limit.

I’ve seen the following solutions:

1/ Implement the smart truncation in the Confluence importer (using an option to specify the max length)

but we need to check for possible duplicates
we need to make sure all places where a page name is needed implement it
At this point on the project we are working on it was a bit risky to try this.

2/ Add an option to use document IDs instead of the titles for the page name

this has the issue of not allowing for “readable” page URLs

3/ Continue using a post processing renaming module which allows to change the max length and rerun if needed

In addition, we should look at extending the 255 character limit as this issue is showing that large space importing with around 10 levels deep brings you close to the 255 limit even with page IDs instead of page names.

Now ideally the confluence import should import directly in the page hierarchy without needed nested pages migration AND solve the page length issue automatically, base on the capacity of XWiki (255 page name limits or higher ones).
To do this we would need an algorithm that measures for each document the number of levels in it’s hierarchy and then decide how much to restrict the page name length (the deeper the hierarchy under, the shorter the page should be), and still keep it under 30 or 40 for all cases as this is what it should be in order to keep a reasonable URL even with 10 levels.

At this point I can share the tool I’ve been coding to reduce page names. If you contact me on matrix (ldubost:matrix.xwiki.com), I can send you a XAR to test it on your side.

Ludovic

Vertganti · February 18, 2021, 10:57am

Thanks lucaa and ludovic for the answers!

I noticed this as well, as I got SQL errors during migration plan execution even when the NPM did not show errors after calculating the plan anymore. The NPM version from GitHub offers functionality to alter page titles in the plan and save migration projects, which made it possible to migrate a space with only a few too long names with some reruns. (However, slashes which were removed from page names by the importer reappeared after migration. This causes problems with our reverse proxy setup, but that is a different issue.)

So for the solution I tried (1/) there is the mentioned problem that the NPM will recreate the full names for any pages not at the bottom of the hierarchy. Would this be circumvented with solution 2/ (Meaning it could be handled in the Confluence importer)? Or is this an issue of the NPM? Anyway solution 2/ seems better than my truncation approach (1/) as different pages with the same name will be properly distinguished, although completely eliminating URL readability.

I assume the workflow for your solution (3/) is: 1. Confluence importer; 2. Renaming module; 3. NPM. How does your post processing module fare when followed by nested pages migration? Does it keep the short names? How does it achieve this? I am asking so I can understand a bit more about what is going on in the background and what I might be missing in my approach

Sounds great, I will reach out on matrix soon.

I’m guessing the limit of 255 stems from file system/OS path limits? Or does this ahve a different reason?

Thanks again!

tmortagne · February 22, 2021, 4:04pm

Not really, this is just the size of the varchar that was used a long time ago because it was the maximum size on MySQL that an indexed value could have.