How should we repair job status filenames

Hi everyone,

in 16.10.0RC1, as part of XCOMMONS-2387, the storage path of job statuses has been changed to use a combination of base64 encoding and URL encoding, whereas before we only had URL encoding. No migration has been performed as part of this change, so if you upgraded from before 16.10.0RC1, you’ll have the following issues now:

  1. Job statuses created before 16.10.0RC1 cannot be read anymore (see XCOMMONS-3275).
  2. Some job statuses now exist both under the old and the new location if a job with the same ID has been executed both before and after upgrading (this is common for extension jobs).

In addition to that, the new implementation isn’t bug-free, see XCOMMONS-3276.

Further, I found that the repair feature that is meant for changes in the job storage path structure that we have so far neither properly deals with jobs that exist both at the old and the new location nor with job logs that are stored outside the job status (so it’s kind of good we didn’t trigger it). @tmortagne further pointed out that loading all job status - as the repair feature does - might be slow.

Proposal 1: Reverting to the naming

To me, the base64-based encoding provides no real advantages over the URL encoding but has the significant disadvantage that it makes it very hard for a human to locate a job log in the file system. Therefore, as first proposal, I suggest that we revert the job storage to almost the old code except for a fix for long job IDs (but without base64 encoding) and some special characters.

Where I’m not sure is if we should use this occasion to also adjust the encoding to deal with case-insensitive file systems by, e.g., encoding all uppercase characters (lowercase characters seem very common, so it would be bad to encode them).

Proposal 2: Move existing job logs on first start

As part of a pull request, I’ve already proposed the following fixes:

  1. Fix the repair to also move job logs.
  2. Implement a streaming parser of the job status that extracts just the ID at least for job statuses that use the abstract base classes that we provide. All other job statuses would still be fully loaded.
  3. Handle the case where the job log exists both in the old and the new location by keeping the more recently modified file.

The advantage of this repair is that all job logs will be in the correct location again, and we clean up all base64-encoded job status files. However, I’m not particularly fond of the huge parser just for extracting the ID. Further, the repair step where we might move many job logs and possibly delete some of them sounds dangerous.

Proposal 3:

Instead of doing the repair, I’m suggesting a second solution:

  1. Keep implementations for computing all possible locations for job statuses. We might not do this for the oldest one (for which we have a repair that apparently worked), but at least for the one before 16.10.0RC1, the one introduced in 16.10.0RC1, and the new one.
  2. When loading a job status, when the job status cannot be found, check iteratively if it can be found in one of the older locations. If it can be found, move it to the current location.

The advantage of this is that there is no need for an expensive repair check or a huge chunk of code for reading IDs as we only ever read the job status that we would read, anyway. Further, as we don’t delete anything (just move the job status) the risk for data loss seems much lower.

The big disadvantage of this solution is that as the new encoding is so similar to the one before 16.10.0RC1, it is likely that if a job with the same ID ran both before and after upgrading to 16.10.0RC1, we will return the one from before the upgrade. We could mitigate this by intentionally introducing a difference in the naming, but I’m not sure which one we should introduce.

Do you have any opinions which out of 2 or 3 we should choose? Do you agree with proposal 1?

I confirm it was a pain for me to locate the PDF export job logs (I had to rely on the last modification date). So +1 for this.

I prefer 3. It looks simpler and less dangerous.

Thanks,
Marius

Thanks for working on this!

+1

Both your proposals only makes sense for upgrades AFAIU and not for new instances. So it’s new code to deal with that, and it might only be interesting for the few people that wants to access old logs. I’m wondering if the simplest solution wouldn’t be to introduce your code from proposal 2 in some admin tooling such as the Admins Tools Application. Other option could be to provide proposal 3 in XS but with a configuration switch: by default we don’t enable it, but we document and allows admins to enable it if needed.

+1

I’m fine with both 2 and 3. 3 is better for performances, but 2 is easier to maintain on the long run (the same migration code works for any future format without changing anything, which is why I did it like this for the first migration). I guess I would be pushing more for 2 if it was not so painful to extract the id from the XML.

One possibility could be to start prefixing the path with the version of specification. So all the new jobs would end up in something like <data>/data/jobs/status/3. It’s slightly annoying to have one more level, but it should not make the total path much longer, and it makes it much more explicit how one should read the folders (I’m sure there will be other changes in the futures). It’s actually interresting for both proposals, as in the case of Proposal 2 it reduces the folders to look at (for example when XWiki stops before the migration is fully finished, and it has to run it again, I guess that also means we would not need store.properties anymore).

The same can be said of any migration, and I think we want to make upgrades as smooth as possible. None of the proposals are complex enough to justify this, IMO.

I’m not convinced by that, the problem being that there might be job statuses we cannot read because they contain classes that aren’t available in the (current) class loader and so it is hard to move them to the new structure, and we would read them again and again on every start.

I think this is actually another argument for proposal 3 as at the moment where some code actually requests a job status, it is much more likely that all required classes are available (because the request is e.g., from some code in the subwiki that has the required extension installed).

Based on the feedback, I think we should go with proposal 3. This leaves two option questions:

  1. Should we add a prefix to the path, like the 3 suggested by @tmortagne such that we have <data>/job/status/status/3/…?
  2. Should we encode uppercase characters to make job statuses case-sensitive on case-insensitive file systems? I doubt this is an issue with real jobs but in theory it could be an issue, in particular with user-controlled job IDs.

Thank you very much for your feedback!

+1

+1 for bulletproofing

It’s not clear to me why status is doubled here, just a mistake ?

Sorry, yes, it’s a mistake, I meant data/jobs/status/3 as you proposed.