Hi everyone,
currently, job statuses are stored in the permanent directory. This has several problems, in particular in clustered setups:
- Job status data and logs are only available on the node where a job was started but for many jobs it makes sense to also have them available on other nodes - there is no reason the logs of a page move should only be available on one cluster node. We could imagine in the future that, for example, a page move job could be executed on a different cluster node than the node that displays the job progress to the user, having long-running nodes that execute background jobs and short-lived nodes that can be added or removed to handle surges in user activity that only execute servlet requests but no background jobs.
- There is no easy/efficient way to build admin UIs about past jobs as we can’t easily enumerate and filter jobs.
Additionally, we’ve been asked to work removing permanent file storage to make deployments in Kubernetes easier and job statuses and their logs are an important remaining part of the permanent file storage.
Summary
The short version of my proposal is to store key metadata and the full logs of every job in the database of the main wiki and to store the full serialized job status in the configured blob store. For the implementation, I propose overriding the default JobStatusStore implementation in xwiki-commons by a new implementation in xwiki-platform, refactoring the code as necessary to reduce duplications. I also plan to introduce the concept of a job being global to all cluster nodes or local to a cluster node.
Storage Location
For the long version, let’s consider the different options for storing in particular job logs:
- Blob store. For the job log, we need to append a new log entry every time the job logs a new message. The problem is that we can’t really append data in Amazon S3. Every time we want to append a log entry, we would need to basically copy the whole log. We could simulate appending while the job runs by basically starting an upload and keeping the upload buffer open. We could incrementally create the log file in parts of 5MB size and only commit the upload, e.g., every 10 minutes, thereby limiting this “copy the whole data” to once every 10 minutes. This has several problems, though:
- Data is only saved every 10 minutes (or whatever interval we choose), leading to potential data loss when an instance crashes during that time.
- Data is only visible to other cluster members after it has been persisted, so also very infrequently.
- Synchronization of parallel writes to the job log would be difficult/would need to be handled explicitly and outside the blob store.
- Database. Databases are able to efficiently handle frequent updates and queries. They also support auto-incrementing fields and transactions to ensure that if a job should be executed by several threads all of them can add log entries in parallel with appropriate locking/serialization guarantees. Databases are less suitable for huge data or flexible fields, which is why I’m proposing to keep the main serialized job status that could be big outside the database. Further, columns would be limited to standard metadata of the job status and the job log entries, with full metadata being XML-serialized in the blob store for the job status and in the database for the log entry.
- Solr. Solr would be ideal for the searchability of logs. Solr would also be able to handle dynamic fields, possibly allowing us to store non-standard metadata for the jobs themselves. However, Solr doesn’t really provide relational operations, making it not so obvious how to store both job logs and job statuses. Nested documents could be a solution to nest logs inside the job status but would require re-indexing the whole job status including all nested logs for every added job log entry. Further, frequent updates/commits can make Solr slow as it isn’t optimized for this use case so we would need to balance performance and how fast logs would be stored/available for reading. Support for locking and atomic updates is also very limited, making it challenging to properly synchronize writes in case a job should be executed by several threads/cluster nodes.
Overall, I think the database is the best option we currently have. If we should work on an admin UI for searching, e.g., job logs, I think we could also still add a Solr index for better search performance in addition to the database.
Location of the Implementation
The implementation of the job status store is currently in xwiki-commons. Unfortunately, in xwiki-commons, we don’t have any APIs for database or Solr support. It feels wrong to me to remove the support for storing job logs from xwiki-commons. For this reason, I suggest that we re-implement JobStatusStore in xwiki-platform. From what I’ve seen so far, this API is the right level of abstraction for having two different implementations targeting file system and database, respectively. There might be some parts that could be shared, e.g., regarding the serialization, but it should be easy to refactor those parts into components that can be shared between the implementations.
From my point of view, the plan should be to keep the current implementation indefinitely and to just add the new implementation in xwiki-platform. We would also use the current implementation for migration.
Migration
My current idea for the migration of existing data is the following:
- A migration would trigger a background job/task to migrate all job logs. We would try unserializing every job status, with a size limit. For a job status that is too big or fails to unserialize, we would try parsing the job status with a streaming XML parser to extract the job ID and potentially other metadata. I see two options then to treat job statuses that still fail to be parsed:
- We could try extracting the job ID from the file name, and use this (potentially partial) job ID for the database entry. Such incomplete entries would be marked and could potentially be retried later.
- We don’t migrate job statuses where we can’t at least extract some basic metadata. The migration could be triggered again, e.g., with some admin UI, after we’ve either improved the migration or errors have been fixed (like an extension has been installed again that contains a serialized type).
- When a job status cannot be found in the database, the new implementation falls back to the existing job status store. If the job status can be found there, the job status is migrated immediately before returning it. Locking is used to prevent that the background job migrates the job status at the same time.
Clustering and Job IDs
The current implementation is completely unaware of clustering, each job status and its logs are only available on the cluster node where the job was started. Many job IDs are supposed to be globally unique but from what I understand there are also jobs like extension installation jobs that exist on every cluster node with the same ID. This caught me by surprise a few days ago as so far I had somehow assumed that job IDs would be globally unique. Some ideas:
- For every job, store the cluster node on which it is created. When querying a job, first search for the job with the current cluster node and then search for the job of another cluster node. Add the cluster node as attribute in the job status. We would need to check if this doesn’t break any code that first checks for an existing job status before creating a new one if the job is supposed to be local to the cluster node. We should probably add a new API to query explicitly for a job of a certain cluster node/of the current cluster node.
- Explicitly differentiate between global and local jobs with a new attribute in the job status/request(?). For backwards compatibility, assume that jobs are local by default. Similar to the first idea, we would store the cluster node on which a job is created, but only for local jobs. The existing API would only query for jobs of the current cluster node or global jobs. We would add new APIs to explicitly query jobs of a different cluster node. A question I don’t know how to answer is how we should migrate existing job status data, in particular for jobs that aren’t bundled and thus can’t provide the information if they are global at migration time. Maybe it is okay to just store them as local jobs as they’re only available on one cluster node currently, too?
I think I prefer option 2 but option 1 might be a bit more backwards compatible for extensions that provide jobs that should be global.
Thank you for reading until here! I’m happy to hear any kind of feedback, but if you don’t have the time to comment on all these details, I would be most interested if you agree with the general direction of storing job status data and logs in the database.