Clustering in Kubernetes not working (second instance failing to start properly)

Hi,

Until starting the work to build an XWiki cluster, I have a single instance 12.10.9 Docker image (with some minor tweaks) running in Kubernetes as a Deployment. I mount the xwiki.properties and xwiki.cfg files from a ConfigMap into the /etc/xwiki/ directory for the pod/container. This is in AWS using GP2 persistent volume for perm data. That is all working fine.

I’m now switching to running XWiki in a cluster but the second instance in the cluster fails to start properly. I’ll include details of what I’ve done and the errors/exceptions I’m seeing. Any help would be appreciated, and thanks in advance.

So far, I’ve done the following:

  • Switch to a StatefulSet.
  • Set up and use AWS EFS for a shared perm data dir.
  • I’ve created a “tcp-k8s.xml” file that I’ve put into WEB-INF/observation/remote/jgroups/
  • Added the Jar file for org.jgroups.kubernetes:jgroups-kubernetes:1.0.16.Final to the WEB-INF/lib/ to enable support for KUBE_PING
    • I chose 1.0.16 as that appeared to match the JGroups version in XWiki 12.10.9.
  • Updated xwiki.properties to add the following:
    • observation.remote.enabled=true
    • observation.remote.channels=tcp-k8s
  • Currently sticking with integrated Solr until I have the rest of the cluster running properly (…which according to the Performance seciton https://www.xwiki.org/xwiki/bin/view/Documentation/AdminGuide/Clustering/ should be ok). I’ll then move to an external Solr.

When the first instance comes up, it appears to be ok judging by:

...other logs...
-------------------------------------------------------------------
GMS: address=xwiki-0-32014, cluster=event, physical address=127.0.0.1:7800
-------------------------------------------------------------------
2021-09-27 09:05:04,141 [OfficeProcessThread-0] INFO  .o.r.i.j.JGroupsNetworkAdapter - Channel [tcp-k8s] started
...other logs...

When the second instance comes up, I get the exceptions below.

2021-09-27 11:18:56,914 [localhost-startStop-1] ERROR .o.i.DefaultObservationManager - Failed to lookup listeners 
org.xwiki.component.manager.ComponentLookupException: Failed to lookup component with type [interface org.xwiki.observation.EventListener] and hint [MentionsCreatedEventListener]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstanceMap(EmbeddableComponentManager.java:245)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstanceList(EmbeddableComponentManager.java:225)
	at org.xwiki.observation.internal.DefaultObservationManager.initializeListeners(DefaultObservationManager.java:166)
	at org.xwiki.observation.internal.DefaultObservationManager.getListenersByEvent(DefaultObservationManager.java:132)
	at org.xwiki.observation.internal.DefaultObservationManager.notify(DefaultObservationManager.java:283)
	at org.xwiki.component.internal.StackingComponentEventManager.sendEvent(StackingComponentEventManager.java:151)
	at org.xwiki.component.internal.StackingComponentEventManager.flushEvents(StackingComponentEventManager.java:92)
	at org.xwiki.container.servlet.XWikiServletContextListener.contextInitialized(XWikiServletContextListener.java:124)
	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4763)
	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5232)
	at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
	at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:753)
	at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:727)
	at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:695)
	at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1177)
	at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1925)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.xwiki.component.manager.ComponentLookupException: Failed to lookup component [org.xwiki.mentions.internal.DefaultMentionsEventExecutor] identified by type [interface org.xwiki.mentions.internal.MentionsEventExecutor] and hint [default]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:204)
	at org.xwiki.component.embed.EmbeddableComponentManager.getDependencyInstance(EmbeddableComponentManager.java:406)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:355)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstanceMap(EmbeddableComponentManager.java:242)
	... 20 common frames omitted
Caused by: org.xwiki.component.phase.InitializationException: Failed to initialize the queue
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:96)
	at org.xwiki.component.embed.InitializableLifecycleHandler.handle(InitializableLifecycleHandler.java:39)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:365)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:201)
	... 24 common frames omitted
Caused by: java.lang.IllegalStateException: The file is locked: nio:/usr/local/xwiki/data/mentions/mvqueue [1.4.200/7]
	at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:950)
	at org.h2.mvstore.FileStore.open(FileStore.java:172)
	at org.h2.mvstore.MVStore.<init>(MVStore.java:381)
	at org.h2.mvstore.MVStore.open(MVStore.java:502)
	at org.xwiki.mentions.internal.async.DefaultMentionsBlockingQueueProvider.initBlockingQueue(DefaultMentionsBlockingQueueProvider.java:69)
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:94)
	... 28 common frames omitted
2021-09-27 11:18:56,924 [localhost-startStop-1] ERROR .o.i.DefaultObservationManager - Failed to lookup the Event Listener [MentionsUpdatedEventListener] corresponding to the Component registration event for [org.xwiki.mentions.internal.listeners.MentionsUpdatedEventListener]. Ignoring the event 
org.xwiki.component.manager.ComponentLookupException: Failed to lookup component [org.xwiki.mentions.internal.listeners.MentionsUpdatedEventListener] identified by type [interface org.xwiki.observation.EventListener] and hint [MentionsUpdatedEventListener]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:204)
	at org.xwiki.observation.internal.DefaultObservationManager.onEventListenerComponentAdded(DefaultObservationManager.java:383)
	at org.xwiki.observation.internal.DefaultObservationManager.onComponentEvent(DefaultObservationManager.java:354)
	at org.xwiki.observation.internal.DefaultObservationManager.notify(DefaultObservationManager.java:299)
	at org.xwiki.component.internal.StackingComponentEventManager.sendEvent(StackingComponentEventManager.java:151)
	at org.xwiki.component.internal.StackingComponentEventManager.flushEvents(StackingComponentEventManager.java:92)
	at org.xwiki.container.servlet.XWikiServletContextListener.contextInitialized(XWikiServletContextListener.java:124)
	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4763)
	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5232)
	at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
	at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:753)
	at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:727)
	at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:695)
	at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1177)
	at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1925)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.xwiki.component.manager.ComponentLookupException: Failed to lookup component [org.xwiki.mentions.internal.DefaultMentionsEventExecutor] identified by type [interface org.xwiki.mentions.internal.MentionsEventExecutor] and hint [default]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:204)
	at org.xwiki.component.embed.EmbeddableComponentManager.getDependencyInstance(EmbeddableComponentManager.java:406)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:355)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:201)
	... 19 common frames omitted
Caused by: org.xwiki.component.phase.InitializationException: Failed to initialize the queue
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:96)
	at org.xwiki.component.embed.InitializableLifecycleHandler.handle(InitializableLifecycleHandler.java:39)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:365)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:201)
	... 23 common frames omitted
Caused by: java.lang.IllegalStateException: The file is locked: nio:/usr/local/xwiki/data/mentions/mvqueue [1.4.200/7]
	at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:950)
	at org.h2.mvstore.FileStore.open(FileStore.java:172)
	at org.h2.mvstore.MVStore.<init>(MVStore.java:381)
	at org.h2.mvstore.MVStore.open(MVStore.java:502)
	at org.xwiki.mentions.internal.async.DefaultMentionsBlockingQueueProvider.initBlockingQueue(DefaultMentionsBlockingQueueProvider.java:69)
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:94)
	... 27 common frames omitted
2021-09-27 11:18:56,934 [localhost-startStop-1] ERROR .o.i.DefaultObservationManager - Failed to lookup the Event Listener [MentionsCreatedEventListener] corresponding to the Component registration event for [org.xwiki.mentions.internal.listeners.MentionsCreatedEventListener]. Ignoring the event 
org.xwiki.component.manager.ComponentLookupException: Failed to lookup component [org.xwiki.mentions.internal.listeners.MentionsCreatedEventListener] identified by type [interface org.xwiki.observation.EventListener] and hint [MentionsCreatedEventListener]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:204)
	at org.xwiki.observation.internal.DefaultObservationManager.onEventListenerComponentAdded(DefaultObservationManager.java:383)
	at org.xwiki.observation.internal.DefaultObservationManager.onComponentEvent(DefaultObservationManager.java:354)
	at org.xwiki.observation.internal.DefaultObservationManager.notify(DefaultObservationManager.java:299)
	at org.xwiki.component.internal.StackingComponentEventManager.sendEvent(StackingComponentEventManager.java:151)
	at org.xwiki.component.internal.StackingComponentEventManager.flushEvents(StackingComponentEventManager.java:92)
	at org.xwiki.container.servlet.XWikiServletContextListener.contextInitialized(XWikiServletContextListener.java:124)
	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4763)
	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5232)
	at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
	at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:753)
	at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:727)
	at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:695)
	at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1177)
	at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1925)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.xwiki.component.manager.ComponentLookupException: Failed to lookup component [org.xwiki.mentions.internal.DefaultMentionsEventExecutor] identified by type [interface org.xwiki.mentions.internal.MentionsEventExecutor] and hint [default]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:204)
	at org.xwiki.component.embed.EmbeddableComponentManager.getDependencyInstance(EmbeddableComponentManager.java:406)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:355)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:201)
	... 19 common frames omitted
Caused by: org.xwiki.component.phase.InitializationException: Failed to initialize the queue
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:96)
	at org.xwiki.component.embed.InitializableLifecycleHandler.handle(InitializableLifecycleHandler.java:39)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:365)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:201)
	... 23 common frames omitted
Caused by: java.lang.IllegalStateException: The file is locked: nio:/usr/local/xwiki/data/mentions/mvqueue [1.4.200/7]
	at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:950)
	at org.h2.mvstore.FileStore.open(FileStore.java:172)
	at org.h2.mvstore.MVStore.<init>(MVStore.java:381)
	at org.h2.mvstore.MVStore.open(MVStore.java:502)
	at org.xwiki.mentions.internal.async.DefaultMentionsBlockingQueueProvider.initBlockingQueue(DefaultMentionsBlockingQueueProvider.java:69)
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:94)
	... 27 common frames omitted
2021-09-27 11:18:56,945 [localhost-startStop-1] ERROR .o.i.DefaultObservationManager - Failed to lookup the Event Listener [MentionsApplicationReadyEventListener] corresponding to the Component registration event for [org.xwiki.mentions.internal.listeners.MentionsApplicationReadyEventListener]. Ignoring the event 
org.xwiki.component.manager.ComponentLookupException: Failed to lookup component [org.xwiki.mentions.internal.listeners.MentionsApplicationReadyEventListener] identified by type [interface org.xwiki.observation.EventListener] and hint [MentionsApplicationReadyEventListener]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:204)
	at org.xwiki.observation.internal.DefaultObservationManager.onEventListenerComponentAdded(DefaultObservationManager.java:383)
	at org.xwiki.observation.internal.DefaultObservationManager.onComponentEvent(DefaultObservationManager.java:354)
	at org.xwiki.observation.internal.DefaultObservationManager.notify(DefaultObservationManager.java:299)
	at org.xwiki.component.internal.StackingComponentEventManager.sendEvent(StackingComponentEventManager.java:151)
	at org.xwiki.component.internal.StackingComponentEventManager.flushEvents(StackingComponentEventManager.java:92)
	at org.xwiki.container.servlet.XWikiServletContextListener.contextInitialized(XWikiServletContextListener.java:124)
	at org.apache.catalina.core.StandardContext.listenerStart(StandardContext.java:4763)
	at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5232)
	at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:183)
	at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:753)
	at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:727)
	at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:695)
	at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1177)
	at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1925)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.xwiki.component.manager.ComponentLookupException: Failed to lookup component [org.xwiki.mentions.internal.DefaultMentionsEventExecutor] identified by type [interface org.xwiki.mentions.internal.MentionsEventExecutor] and hint [default]
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:204)
	at org.xwiki.component.embed.EmbeddableComponentManager.getDependencyInstance(EmbeddableComponentManager.java:406)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:355)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:201)
	... 19 common frames omitted
Caused by: org.xwiki.component.phase.InitializationException: Failed to initialize the queue
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:96)
	at org.xwiki.component.embed.InitializableLifecycleHandler.handle(InitializableLifecycleHandler.java:39)
	at org.xwiki.component.embed.EmbeddableComponentManager.createInstance(EmbeddableComponentManager.java:365)
	at org.xwiki.component.embed.EmbeddableComponentManager.getComponentInstance(EmbeddableComponentManager.java:451)
	at org.xwiki.component.embed.EmbeddableComponentManager.getInstance(EmbeddableComponentManager.java:201)
	... 23 common frames omitted
Caused by: java.lang.IllegalStateException: The file is locked: nio:/usr/local/xwiki/data/mentions/mvqueue [1.4.200/7]
	at org.h2.mvstore.DataUtils.newIllegalStateException(DataUtils.java:950)
	at org.h2.mvstore.FileStore.open(FileStore.java:172)
	at org.h2.mvstore.MVStore.<init>(MVStore.java:381)
	at org.h2.mvstore.MVStore.open(MVStore.java:502)
	at org.xwiki.mentions.internal.async.DefaultMentionsBlockingQueueProvider.initBlockingQueue(DefaultMentionsBlockingQueueProvider.java:69)
	at org.xwiki.mentions.internal.DefaultMentionsEventExecutor.initialize(DefaultMentionsEventExecutor.java:94)
	... 27 common frames omitted

This suggests that several things try to manipulate this database at the same time. It is possible that this is shared between several instances of XWiki ? Only one is supposed to.

Hi @tmortagne, the instances are using a shared perm data directory as suggested by https://www.xwiki.org/xwiki/bin/view/Documentation/AdminGuide/Clustering/. The install is using the Standard Flavor.

Judging by the comments in the XWiki Matrix channel, it sounds like @vmassol it updating the Clustering Guide so maybe that will clarify what should be in a shared perm dir and what shouldn’t?

Sorry, I should be clear that the use of the shared perm dir is suggested by Loading..., one of the issues/limitations noted on that page. We also had a discussion on Loading... where using a shared perm dir was also recommended.

Maybe I’ve read the “shared perm dir” as a blanket statement where in fact it should be specific directories which are shared whilst others remain node specific?

Yes, the permdir contains much more than “the attachments storage” which is only the store/file subfolder. I updated the issue description.

Ok, thanks @tmortagne.

Does this table look ok for what could be mapped as a shared directory vs instance specific? Any improvements/corrections?

Dir Shared?
cache no
cache/extension no
cache/solr no
extension no
extension/history no
extension/repository yes
jobs no
mentions no
store no
store/file yes
store/solr no

I’ve documented it at https://www.xwiki.org/xwiki/bin/view/Documentation/AdminGuide/Clustering/?viewer=changes&rev1=14.1&rev2=14.2&

I don’t think you should share extension/repository since extensions are cluster-aware already (see Loading...).

Thank you @vmassol.

The doc update also looks good.

And store/solr is not recommended to be shared (best is to use a remote SOLR setup) but it should work and be marked optional in your table I think. See https://www.xwiki.org/xwiki/bin/view/Documentation/AdminGuide/Clustering/#HPerformances

Ah ok, the reason I marked extension/repository as shared was because of a discussion I had with @tmortagne on Loading....

No, it won’t work (lock conflict) and embedded solr is already cluster aware.

Yes as we discussed it’s not recommended but it should work.

I’ll be switching to a remote Solr once I have clustering working as I was trying to avoid tackling too many tasks/changes at the same time :slightly_smiling_face:

Ok, I didn’t quite get “not recommended” from your comment on XWIKI-1441. You did mention…

There is a small risk of conflict in theory (like always with a shared drive), but it’s probably very unlikely in the context of extensions.

…but as we discussed it further, the risk and likelihood appeared minimal.

The ability to scale an XWiki cluster up, or even down, means that there isn’t really going to be a “the cluster is X instances” from an operations point-of-view. Having to prep a perm data dir before a new instance can be added to the cluster is a complexity I’d like to avoid if I can, if using a shared dir is viable. As it sounds like the advice is still “it should work” whether or not it’s recommended, that’s what I’ll start with.

Thanks again @tmortagne and @vmassol

For reference in case anyone else looks at this post. The updated shared data table looks like:

DirCan be shared?
cacheno
cache/extensionno
cache/solrno
extensionno
extension/historyno
extension/repositoryyes (not recommended but, if not shared is then subject to XWIKI-11441)
jobsno
mentionsno
storeno
store/fileyes
store/solrno

@tmortagne then we need to update the doc at https://www.xwiki.org/xwiki/bin/view/Documentation/AdminGuide/Clustering/#HPerformances

Why ? This documentation suggests using a standalone Solr instance, not to share its storage between several instances (it even explicitly indicate that keeping Solr index locals is a supported use case).

Sure, but there is still a risk (and a not tested one) and a small gain since installing extensions is something you can plan. That’s why it’s not recommended in my mind.

Yes, in the case of an unstable list of nodes, you probably don’t have much choice.

Ok, I mixed stuff indeed. All is fine.

IMHO, in an Enterprise environment where you can both probably over provision the cluster initially and the number of users isn’t going to drastically change then the cluster getting smaller is unlikely and it growing is an event you can certainly plan for. I.e. a company acquisition may cause employee numbers to jump and may prompt you to need an additional xwiki instance or more; however, it is done ahead of time and can be planned for a fixed date.

In my case, how long the number of nodes will remain constant for is not something I know right now. As such, scaling up the cluster is something that will be done on an as-needed basis based on feedback from monitoring. It won’t be done automatically (yet) but, I’d like the complexity of that operation to be as low as possible. I also tend to think of things in terms of “what if this has to be done at 2am because of a service issue” :slightly_smiling_face: As such, I like things to be as simple as possible. As I’m deploying in Kubernetes the difference would be:

Using Shared dir:

  1. kubectl -n <namespace> scale xwiki --replicas X

Not using shared dir:

  1. kubectl -n <namespace> scale xwiki --replicas X
    • At this point, you either have to affect traffic routing to make sure the new instances do not receive traffic or you run the risk of users hitting an instance that doesn’t yet have all the extensions.
  2. Copy an existing PersistentVolume's contents to another machine.
  3. Upload the contents from that machine to the new PersistentVolume(s) that were created by scaling the StatefulSet
  4. Restart each new XWiki instance.
  5. Enable traffic.

That’s of course all in theory, I’ve not tried it :slightly_smiling_face: