Moving job statuses and logs out of the permanent file storage

Hi everyone,

currently, job statuses are stored in the permanent directory. This has several problems, in particular in clustered setups:

  • Job status data and logs are only available on the node where a job was started but for many jobs it makes sense to also have them available on other nodes - there is no reason the logs of a page move should only be available on one cluster node. We could imagine in the future that, for example, a page move job could be executed on a different cluster node than the node that displays the job progress to the user, having long-running nodes that execute background jobs and short-lived nodes that can be added or removed to handle surges in user activity that only execute servlet requests but no background jobs.
  • There is no easy/efficient way to build admin UIs about past jobs as we can’t easily enumerate and filter jobs.

Additionally, we’ve been asked to work removing permanent file storage to make deployments in Kubernetes easier and job statuses and their logs are an important remaining part of the permanent file storage.

Summary

The short version of my proposal is to store key metadata and the full logs of every job in the database of the main wiki and to store the full serialized job status in the configured blob store. For the implementation, I propose overriding the default JobStatusStore implementation in xwiki-commons by a new implementation in xwiki-platform, refactoring the code as necessary to reduce duplications. I also plan to introduce the concept of a job being global to all cluster nodes or local to a cluster node.

Storage Location

For the long version, let’s consider the different options for storing in particular job logs:

  1. Blob store. For the job log, we need to append a new log entry every time the job logs a new message. The problem is that we can’t really append data in Amazon S3. Every time we want to append a log entry, we would need to basically copy the whole log. We could simulate appending while the job runs by basically starting an upload and keeping the upload buffer open. We could incrementally create the log file in parts of 5MB size and only commit the upload, e.g., every 10 minutes, thereby limiting this “copy the whole data” to once every 10 minutes. This has several problems, though:
    1. Data is only saved every 10 minutes (or whatever interval we choose), leading to potential data loss when an instance crashes during that time.
    2. Data is only visible to other cluster members after it has been persisted, so also very infrequently.
    3. Synchronization of parallel writes to the job log would be difficult/would need to be handled explicitly and outside the blob store.
  2. Database. Databases are able to efficiently handle frequent updates and queries. They also support auto-incrementing fields and transactions to ensure that if a job should be executed by several threads all of them can add log entries in parallel with appropriate locking/serialization guarantees. Databases are less suitable for huge data or flexible fields, which is why I’m proposing to keep the main serialized job status that could be big outside the database. Further, columns would be limited to standard metadata of the job status and the job log entries, with full metadata being XML-serialized in the blob store for the job status and in the database for the log entry.
  3. Solr. Solr would be ideal for the searchability of logs. Solr would also be able to handle dynamic fields, possibly allowing us to store non-standard metadata for the jobs themselves. However, Solr doesn’t really provide relational operations, making it not so obvious how to store both job logs and job statuses. Nested documents could be a solution to nest logs inside the job status but would require re-indexing the whole job status including all nested logs for every added job log entry. Further, frequent updates/commits can make Solr slow as it isn’t optimized for this use case so we would need to balance performance and how fast logs would be stored/available for reading. Support for locking and atomic updates is also very limited, making it challenging to properly synchronize writes in case a job should be executed by several threads/cluster nodes.

Overall, I think the database is the best option we currently have. If we should work on an admin UI for searching, e.g., job logs, I think we could also still add a Solr index for better search performance in addition to the database.

Location of the Implementation

The implementation of the job status store is currently in xwiki-commons. Unfortunately, in xwiki-commons, we don’t have any APIs for database or Solr support. It feels wrong to me to remove the support for storing job logs from xwiki-commons. For this reason, I suggest that we re-implement JobStatusStore in xwiki-platform. From what I’ve seen so far, this API is the right level of abstraction for having two different implementations targeting file system and database, respectively. There might be some parts that could be shared, e.g., regarding the serialization, but it should be easy to refactor those parts into components that can be shared between the implementations.

From my point of view, the plan should be to keep the current implementation indefinitely and to just add the new implementation in xwiki-platform. We would also use the current implementation for migration.

Migration

My current idea for the migration of existing data is the following:

  1. A migration would trigger a background job/task to migrate all job logs. We would try unserializing every job status, with a size limit. For a job status that is too big or fails to unserialize, we would try parsing the job status with a streaming XML parser to extract the job ID and potentially other metadata. I see two options then to treat job statuses that still fail to be parsed:
    1. We could try extracting the job ID from the file name, and use this (potentially partial) job ID for the database entry. Such incomplete entries would be marked and could potentially be retried later.
    2. We don’t migrate job statuses where we can’t at least extract some basic metadata. The migration could be triggered again, e.g., with some admin UI, after we’ve either improved the migration or errors have been fixed (like an extension has been installed again that contains a serialized type).
  2. When a job status cannot be found in the database, the new implementation falls back to the existing job status store. If the job status can be found there, the job status is migrated immediately before returning it. Locking is used to prevent that the background job migrates the job status at the same time.

Clustering and Job IDs

The current implementation is completely unaware of clustering, each job status and its logs are only available on the cluster node where the job was started. Many job IDs are supposed to be globally unique but from what I understand there are also jobs like extension installation jobs that exist on every cluster node with the same ID. This caught me by surprise a few days ago as so far I had somehow assumed that job IDs would be globally unique. Some ideas:

  1. For every job, store the cluster node on which it is created. When querying a job, first search for the job with the current cluster node and then search for the job of another cluster node. Add the cluster node as attribute in the job status. We would need to check if this doesn’t break any code that first checks for an existing job status before creating a new one if the job is supposed to be local to the cluster node. We should probably add a new API to query explicitly for a job of a certain cluster node/of the current cluster node.
  2. Explicitly differentiate between global and local jobs with a new attribute in the job status/request(?). For backwards compatibility, assume that jobs are local by default. Similar to the first idea, we would store the cluster node on which a job is created, but only for local jobs. The existing API would only query for jobs of the current cluster node or global jobs. We would add new APIs to explicitly query jobs of a different cluster node. A question I don’t know how to answer is how we should migrate existing job status data, in particular for jobs that aren’t bundled and thus can’t provide the information if they are global at migration time. Maybe it is okay to just store them as local jobs as they’re only available on one cluster node currently, too?

I think I prefer option 2 but option 1 might be a bit more backwards compatible for extensions that provide jobs that should be global.

Thank you for reading until here! I’m happy to hear any kind of feedback, but if you don’t have the time to comment on all these details, I would be most interested if you agree with the general direction of storing job status data and logs in the database.

I don’t know well the domain, but one option I don’t see mentioned is to use suitable third-party software for this specific task.
It would make the deployment of XWiki slightly more difficult (but setting up an XWiki cluster is probably already not trivial). But it would also allow using a robust technology built for the task instead of relying on what’s available to us.

Otherwise, the combination of the db for efficient writing + solr for the indexation sounds like a good combination.

I guess it’s fine since we were able to store the job logs in the file system so far. But do you have an idea of the impact of your change on disk use?

Sounds good to me.

If you put it on the request it means the job supports both cases and the code triggering the job can decide whether the job is local or global. If you put it on the status, then the job supports only one mode. It’s not clear to me if being local or global is something that should be decided by the job itself or the code triggering the job.

Thanks,
Marius

+1

+1

Sounds good in general.

Note that one problem with the global migration is that a job status (or a log) can very easily be impossible to unserialized because it’s referring classes from extensions, which might not be visible from the root classloader (which is most probably what would be available to a global migration). Not sure we should try too hard to migrate something we cannot unserialize and just hope that 2. will be enough for those.

Same here. But before thinking about allowing to indicate that in the request, the job needs a way to indicate what it supports (always local, always global, both).

Others

Moving the job status/log out of the (easily accessible) filesystem might make the need for a generic UI to navigate them even more important, as looking directly at those (for debugging purpose usually) is quite common.

Definitely.

2 Likes

I’ve searched a bit what the main options are here. From what I found, there seem to be roughly the following options that are maintained and open source:

  • Logstash in combination with ElasticSearch for indexing
    • AGPL software, wasn’t open source for some time
    • This seems to be main option for log collection with highly versatile query support
  • Grafana Loki
    • AGPL license
    • Alloy for log collection, Grafana for querying and displaying
    • Metadata index only, no full text search
  • fluentbit
    • Generic log collection and distribution framework
    • Apache 2.0 license, CNCF project
    • Could be used with OpenSearch (Apache 2.0 licensed fork of ElasticSearch) for indexing, it also seems to support PostgreSQL.
    • Also an OpenTelemetry Collector, can also be used with both ElasticSearch and Grafana Loki

Overall, these appear to be quite complex systems that I wouldn’t want to require for running XWiki. To me, the decision for one of those systems should be rather on the admin than on the developer side in my opinion. I don’t think we should use or force the use of a particular log collection system.

Regarding simpler options, there is database support for logback in logback-db, but it doesn’t look that well maintained (last real change in 2022) - or is it just “done”?

A further problem with all those solutions is that they won’t preserve parameter types which we currently use in the log displayer for jobs. For example, we support storing document references as log parameters that are then transformed into clickable links in the log displayer.

For all these reasons, I still think that a custom database storage would be the best option for now.

What seems like an interesting idea, though, would be to optionally support such a backend for storing and querying logs.

Regarding storage, I think we should:

  • Add appropriate metadata to job logs to make it easy to filter them on the level of SL4J/Logback
  • Optionally, or even by default, disable log isolation for jobs such that centrally configured log collection can capture logs
  • Gather experience on our infrastructure and document how to collect logs, maybe with Grafana Loki

Regarding retrieval, I would see if we could make it possible to replace job log storage independent of the job status storage. The idea would be that you could:

  • Replace or disable job log storage in the database
  • Replace the job log retrieval implementation by one that uses an external API like the log query API of Grafana Loki.

If there is interest and sponsoring, we could develop and maintain such an implementation in XS. Having it an extension sounds interesting but difficult to achieve as extension operations need logging, too. An open question would still be how to support rich parameter types like document references.

Coming back to this, I think this should be something that is decided by the job itself. I can’t think of any example where both would make sense. Any job that affects the wiki data should probably be global while jobs that perform some operation on data that is local to the cluster node or that should be executed on each cluster node needs to be local. If a job really supports both it could still add an attribute to its request. I think having such an attribute on every job request would just cause confusion.

1 Like

Thanks for checking.

+1

I’ve just noticed that the whole concept of clustering and remote observation IDs only exists in xwiki-platform while jobs are a concept of xwiki-commons… While I think this doesn’t prevent us from introducing a isClusterNodeLocal flag, it makes it quite weird to introduce the concept of a cluster ID in job status-related APIs. A possibility could be to just don’t expose the cluster ID and only use it internally to differentiate the job statuses that are stored by different cluster nodes.

We could introduce a new API in xwiki-platform to query job statuses that are stored in the database, and there we could also expose the remote observation ID.

We could also introduce a concept of a ClusterAwareJobStatus in xwiki-platform and consider all other jobs to be node-local.

Any thoughts on this?