Hi everyone,
in 16.10.0RC1, as part of XCOMMONS-2387, the storage path of job statuses has been changed to use a combination of base64 encoding and URL encoding, whereas before we only had URL encoding. No migration has been performed as part of this change, so if you upgraded from before 16.10.0RC1, you’ll have the following issues now:
- Job statuses created before 16.10.0RC1 cannot be read anymore (see XCOMMONS-3275).
- Some job statuses now exist both under the old and the new location if a job with the same ID has been executed both before and after upgrading (this is common for extension jobs).
In addition to that, the new implementation isn’t bug-free, see XCOMMONS-3276.
Further, I found that the repair feature that is meant for changes in the job storage path structure that we have so far neither properly deals with jobs that exist both at the old and the new location nor with job logs that are stored outside the job status (so it’s kind of good we didn’t trigger it). @tmortagne further pointed out that loading all job status - as the repair feature does - might be slow.
Proposal 1: Reverting to the naming
To me, the base64-based encoding provides no real advantages over the URL encoding but has the significant disadvantage that it makes it very hard for a human to locate a job log in the file system. Therefore, as first proposal, I suggest that we revert the job storage to almost the old code except for a fix for long job IDs (but without base64 encoding) and some special characters.
Where I’m not sure is if we should use this occasion to also adjust the encoding to deal with case-insensitive file systems by, e.g., encoding all uppercase characters (lowercase characters seem very common, so it would be bad to encode them).
Proposal 2: Move existing job logs on first start
As part of a pull request, I’ve already proposed the following fixes:
- Fix the repair to also move job logs.
- Implement a streaming parser of the job status that extracts just the ID at least for job statuses that use the abstract base classes that we provide. All other job statuses would still be fully loaded.
- Handle the case where the job log exists both in the old and the new location by keeping the more recently modified file.
The advantage of this repair is that all job logs will be in the correct location again, and we clean up all base64-encoded job status files. However, I’m not particularly fond of the huge parser just for extracting the ID. Further, the repair step where we might move many job logs and possibly delete some of them sounds dangerous.
Proposal 3:
Instead of doing the repair, I’m suggesting a second solution:
- Keep implementations for computing all possible locations for job statuses. We might not do this for the oldest one (for which we have a repair that apparently worked), but at least for the one before 16.10.0RC1, the one introduced in 16.10.0RC1, and the new one.
- When loading a job status, when the job status cannot be found, check iteratively if it can be found in one of the older locations. If it can be found, move it to the current location.
The advantage of this is that there is no need for an expensive repair check or a huge chunk of code for reading IDs as we only ever read the job status that we would read, anyway. Further, as we don’t delete anything (just move the job status) the risk for data loss seems much lower.
The big disadvantage of this solution is that as the new encoding is so similar to the one before 16.10.0RC1, it is likely that if a job with the same ID ran both before and after upgrading to 16.10.0RC1, we will return the one from before the upgrade. We could mitigate this by intentionally introducing a difference in the naming, but I’m not sure which one we should introduce.
Do you have any opinions which out of 2 or 3 we should choose? Do you agree with proposal 1?