Introducing a file storage API

Hi everyone,

to better support deployments, in particular clustered deployments, where shared/permanent file storage isn’t always available, I would like to propose introducing a file storage API with backends for the local file system and for S3-compatible object storage. The goal of this API is to make it easy to support both for XS and extensions, such that admins can choose based on their needs. The object storage backend should support cluster setups, for the file system backend this wouldn’t be a goal (but of course the goal would still be not to make the situation worse). The default would be local file system, and as much as possible file paths should remain the same as they are now.

In the end, the goal of this work would be to support a setup with remote Solr and object storage, in which the local file system is only used for temporary files. Not everything that is currently on the file system should be moved to object storage, some parts should also be moved to the database or Solr.

This storage API should be used for:

  • The file system attachment store.
  • Deleted attachments and documents.
  • Maybe job statuses (but not job logs, as object storage doesn’t efficiently support appending to an existing file, the idea for job logs would be to store them in the database and/or Solr, but that’s a separate proposal).
  • Possibly the extension repository and extension history but this requires more thought and a redesign how extensions are installed in a clustered environment, and we need to consider different scenarios, like extensions that are part of the XWiki deployment itself which shouldn’t be moved to S3.

The idea is to expose a basic API that allows to:

  1. Store an InputStream to a path.
  2. Get an InputStream from a path.
  3. Copy the contents of a path to another path.
  4. Delete a path.
  5. Query basic metadata like if a path exists and its size.
  6. List all files below a path with an iterator (meant for migrations).

At the moment, the idea would be to have a storage “manager” that allows getting a storage with a certain prefix like store/file, jobs/status, or extension. The idea behind this is to allow separate configuration and behavior for each storage.

I found two existing options for common APIs for file and object storage access (Apache Commons VFS, Java NIO) but I believe that our own implementation will give us fewer limitations/we can control them better.

I haven’t discussed this here, as the plan wouldn’t be to implement this right away, but an idea could also be to implement caching and in particular for that we might need more direct APIs to, e.g., find out if a file has changed in the object storage. Another area where abstractions might be lacking is support for conditional writes that we could use to give some kind of guarantees about concurrent updates.

You can find some more details at the remove permanent file storage proposal.

Before continuing with the implementation, I would like to ask if you have any feedback regarding the direction in general, and the introduction of such a storage API in particular. Thank you very much for your feedback!

I second @MichaelHamann’s proposal.

(not much to add since we had a meeting with @MichaelHamann about that before that proposal)

Sounds good.

It’s a pity to not reuse an existing lib (like Apache VFS, that we’re using already (see https://extensions.xwiki.org/xwiki/bin/view/Extension/# in case you didn’t know).

I see there’s a S3 implementation for VFS too (last release is from 2024): GitHub - abashev/vfs-s3: Amazon S3 driver for Apache commons-vfs (Virtual File System) project (basically a one man effort: Contributors to abashev/vfs-s3 · GitHub ).

It would give us several backends for free (that our users may or may not need)…

Related and for fun (old stuff): [xwiki-devs] [Proposal] Virtual file system

As we found out in the chat, this is using TrueVFS and not Apache VFS. And TrueVFS appears unmaintained, last change in 2021 and there is an open PR from 2021. I also couldn’t find any indication that TrueVFS might support S3. We do not seem to be using Apache VFS in any other place.

From the documentation of Apache VFS:

There may only be a single input or output stream open for the file at any time.

As files are cached, we also cannot have several files for the same resource. Therefore, this sounds like a very important limitation for us.

Also, trueVFS is for archives and our needs is for individual files.

To add more context to this, I found the issue VFS-684 that documents that the thread-safety of Apache Commons VFS is basically undocumented but that there are real issues, maybe requiring different file system instances per thread. This would make it super difficult to use this API in XWiki, we would at least need a wrapper around it to hide these thread-safety issues. The issue is old (from 2018) but I’m not convinced that the situation changed in the meantime.