Hi everyone,
to better support deployments, in particular clustered deployments, where shared/permanent file storage isn’t always available, I would like to propose introducing a file storage API with backends for the local file system and for S3-compatible object storage. The goal of this API is to make it easy to support both for XS and extensions, such that admins can choose based on their needs. The object storage backend should support cluster setups, for the file system backend this wouldn’t be a goal (but of course the goal would still be not to make the situation worse). The default would be local file system, and as much as possible file paths should remain the same as they are now.
In the end, the goal of this work would be to support a setup with remote Solr and object storage, in which the local file system is only used for temporary files. Not everything that is currently on the file system should be moved to object storage, some parts should also be moved to the database or Solr.
This storage API should be used for:
- The file system attachment store.
- Deleted attachments and documents.
- Maybe job statuses (but not job logs, as object storage doesn’t efficiently support appending to an existing file, the idea for job logs would be to store them in the database and/or Solr, but that’s a separate proposal).
- Possibly the extension repository and extension history but this requires more thought and a redesign how extensions are installed in a clustered environment, and we need to consider different scenarios, like extensions that are part of the XWiki deployment itself which shouldn’t be moved to S3.
The idea is to expose a basic API that allows to:
- Store an
InputStream
to a path. - Get an
InputStream
from a path. - Copy the contents of a path to another path.
- Delete a path.
- Query basic metadata like if a path exists and its size.
- List all files below a path with an iterator (meant for migrations).
At the moment, the idea would be to have a storage “manager” that allows getting a storage with a certain prefix like store/file
, jobs/status
, or extension
. The idea behind this is to allow separate configuration and behavior for each storage.
I found two existing options for common APIs for file and object storage access (Apache Commons VFS, Java NIO) but I believe that our own implementation will give us fewer limitations/we can control them better.
I haven’t discussed this here, as the plan wouldn’t be to implement this right away, but an idea could also be to implement caching and in particular for that we might need more direct APIs to, e.g., find out if a file has changed in the object storage. Another area where abstractions might be lacking is support for conditional writes that we could use to give some kind of guarantees about concurrent updates.
You can find some more details at the remove permanent file storage proposal.
Before continuing with the implementation, I would like to ask if you have any feedback regarding the direction in general, and the introduction of such a storage API in particular. Thank you very much for your feedback!