Clustering in Kubernetes not working (second instance failing to start properly)

And store/solr is not recommended to be shared (best is to use a remote SOLR setup) but it should work and be marked optional in your table I think. See https://www.xwiki.org/xwiki/bin/view/Documentation/AdminGuide/Clustering/#HPerformances

Ah ok, the reason I marked extension/repository as shared was because of a discussion I had with @tmortagne on https://jira.xwiki.org/browse/XWIKI-11441.

No, it won’t work (lock conflict) and embedded solr is already cluster aware.

Yes as we discussed it’s not recommended but it should work.

I’ll be switching to a remote Solr once I have clustering working as I was trying to avoid tackling too many tasks/changes at the same time :slightly_smiling_face:

Ok, I didn’t quite get “not recommended” from your comment on XWIKI-1441. You did mention…

There is a small risk of conflict in theory (like always with a shared drive), but it’s probably very unlikely in the context of extensions.

…but as we discussed it further, the risk and likelihood appeared minimal.

The ability to scale an XWiki cluster up, or even down, means that there isn’t really going to be a “the cluster is X instances” from an operations point-of-view. Having to prep a perm data dir before a new instance can be added to the cluster is a complexity I’d like to avoid if I can, if using a shared dir is viable. As it sounds like the advice is still “it should work” whether or not it’s recommended, that’s what I’ll start with.

Thanks again @tmortagne and @vmassol

For reference in case anyone else looks at this post. The updated shared data table looks like:

Dir Can be shared?
cache no
cache/extension no
cache/solr no
extension no
extension/history no
extension/repository yes (not recommended but, if not shared is then subject to XWIKI-11441)
jobs no
mentions no
store no
store/file yes
store/solr no

@tmortagne then we need to update the doc at https://www.xwiki.org/xwiki/bin/view/Documentation/AdminGuide/Clustering/#HPerformances

Why ? This documentation suggests using a standalone Solr instance, not to share its storage between several instances (it even explicitly indicate that keeping Solr index locals is a supported use case).

Sure, but there is still a risk (and a not tested one) and a small gain since installing extensions is something you can plan. That’s why it’s not recommended in my mind.

Yes, in the case of an unstable list of nodes, you probably don’t have much choice.

Ok, I mixed stuff indeed. All is fine.

IMHO, in an Enterprise environment where you can both probably over provision the cluster initially and the number of users isn’t going to drastically change then the cluster getting smaller is unlikely and it growing is an event you can certainly plan for. I.e. a company acquisition may cause employee numbers to jump and may prompt you to need an additional xwiki instance or more; however, it is done ahead of time and can be planned for a fixed date.

In my case, how long the number of nodes will remain constant for is not something I know right now. As such, scaling up the cluster is something that will be done on an as-needed basis based on feedback from monitoring. It won’t be done automatically (yet) but, I’d like the complexity of that operation to be as low as possible. I also tend to think of things in terms of “what if this has to be done at 2am because of a service issue” :slightly_smiling_face: As such, I like things to be as simple as possible. As I’m deploying in Kubernetes the difference would be:

Using Shared dir:

  1. kubectl -n <namespace> scale xwiki --replicas X

Not using shared dir:

  1. kubectl -n <namespace> scale xwiki --replicas X
    • At this point, you either have to affect traffic routing to make sure the new instances do not receive traffic or you run the risk of users hitting an instance that doesn’t yet have all the extensions.
  2. Copy an existing PersistentVolume's contents to another machine.
  3. Upload the contents from that machine to the new PersistentVolume(s) that were created by scaling the StatefulSet
  4. Restart each new XWiki instance.
  5. Enable traffic.

That’s of course all in theory, I’ve not tried it :slightly_smiling_face:

As an aside, I must say that I’m very impressed with XWiki and what it is capable of. It seems like whatever you want to do can be done, there’s usually a setting for it somewhere and really the only challenge is sometimes finding out how.

You’ve created an extremely capable platform and to be done as an open source project is also hugely commendable. This could have very easily been a closed source product. You’re engaged with the Forum and on the Matrix channel. I have to say that you’re a wonderful example of how a great open source project can be run.

Just wanted to say Thank You :slightly_smiling_face:

3 Likes

Thanks a lot Alex, that’s heart-warming and we often get more complaints than praises so that feels good :slight_smile:

I’ve taken the liberty of putting your testimonial at Testimonials (XWiki.org) (I hope it’s fine with you!).

Thanks

1 Like

Thanks a lot Alex, that’s heart-warming and we often get more complaints than praises so that feels good :slight_smile:

You’re welcome and hopefully you can pass on my thanks to the rest of the team. You’re doing a great job, and I have no reason to doubt you’ll all keep doing so :slight_smile:

I’ve taken the liberty of putting your testimonial at Testimonials (XWiki.org) (I hope it’s fine with you!).

Of course, no problem.

Hi. I’ll tack this question on to this thread as I’ve run into another issue which is cluster related and might be related to storage.

I’ve now got multiple instances of xwiki running with just the extensions/repository and store/files as shared directories/storage.

The XWiki install was created when there was one instance running. I then scaled to 2 instances. When traffic hits the second instance, it tries to redirect to the distribution wizard, but then the first node receives that traffic which redirects back to /main/WebHome, which hits the second node and that tries to redirect back to WebHome, etc.

From a quick look at the distribution wizard code, I think that it uses the job status to determine whether it needs to run or not. As such, I copied the job status from the storage in instance 1 into the storage for instance 2. That didn’t appear to do much but once I restarted instance 2 the issue was resolved.

My question is, does this suggest that the jobs dir needs to be a shared storage or (and I suspect this is more likely) does it indicate that my cluster nodes are not communicating properly? Or something else entirely?

Thanks, in advance.

Yes and no. If there is nothing to do, you won’t get the DW but if you explicitly cancel steps (for example you have invalid extensions installed which trigger the “Extensions” step) then this cancellation is stored in the job status.

This feels like a bad load balancer setup, which should ideally be a bit more sticky (keep the same session on the same node). Otherwise, you are going to have a lot of problems with everything job based if the load balancer randomly ask for the job status on the wrong node.

I would not recommend it as jobs are by nature associated to a specific node. Sharing it would allow a node to access the logs of finished job which have run on another node (which can probably be interesting for some kinds of jobs based features without clustering support), but it would lead to a lot of conflicts (notably with job based features having clustering support).

Thanks @tmortagne. I’ll keep the Job storage instance specific, and I’ll make the Load Balancer more session sticky (I thought I’d done that but on seeing the loop I could see wasn’t working as I’d thought).

Sorry but I don’t understand how that resolves why the second instance thinks the Distribution Wizard is necessary. Or are you saying that if the jobs dir is to be kept instance specific but DW is still triggering then it implies the cluster comms aren’t working properly?

When a new node joins the cluster, and as the Jobs dir is instance specific, how does the instance determine whether to run the DW or not? Is there something specific I can look for that would indicate that the new instance has caught up with the cluster properly? Obviously, whether or not I see the DW is an indication of things working but, ideally there’d be something specific I can look for.

I’ll keep plugging at it in the meantime as I’m sure you’re very busy.

Cheers

If you see the DW then it means you cancelled a step (the most likely being that you have unresolved invalid extensions caused by some upgrade which are still installed) if you fix those you won’t have the DW anymore, status file or not. The only other alternatives are:

  • actually execute the DW and cancel again the steps when you create a new instance
  • copy the status when you create a new node

Each step have triggers, when you get the DW click next, and you will see which step is enabled which will tell you why you have the DW.

I didn’t cancel any steps when I was going through the UI and looking at the jobs/status/distribution/status.xml output, it appears that all of the steps are marked as completed.

I sorted the session stickiness in the LB and this time ran through the DW. It did in fact skip most steps and I just need to click continue on Events Migration and Pages. This was just a manual step I was not expecting and because the initial page of the DW appeared to be the same as when doing a fresh xwiki install, I thought something was wrong with my setup/cluster.

I’ll experiment with distribution.job.interactive to see if that can get me what I’d ideally want.

Thanks again for the help.

<?xml version="1.0" encoding="UTF-8"?>
<org.xwiki.extension.distribution.internal.job.DistributionJobStatus>
  <progress>
    <listenerName>org.xwiki.job.internal.DefaultJobProgress_1240794181</listenerName>
    <rootStep>
      <message>
        <message>Progress with name [{}]</message>
        <marker class="org.xwiki.logging.marker.TranslationMarker">
          <name>xwiki.translation</name>
          <translationKey>job.progress</translationKey>
        </marker>
        <argumentArray>
          <null/>
        </argumentArray>
      </message>
      <index>0</index>
      <offset>1.0</offset>
      <maximumChildren>7</maximumChildren>
      <childSize>0.14285714285714285</childSize>
      <children>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>0</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>958426906551450</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>15114032127</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>1</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>958442020610523</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>16593841728</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>2</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>958458614477678</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>1351122066606</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>3</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736573188</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>9575</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>4</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736586269</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>5211</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>5</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736594263</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>5210</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>6</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736602152</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>30277372517</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
      </children>
      <startTime>958426906551450</startTime>
      <finished>true</finished>
      <levelFinished>true</levelFinished>
      <elapsedTime>1413108019114</elapsedTime>
    </rootStep>
  </progress>
  <jobType>distribution</jobType>
  <state>FINISHED</state>
  <request class="org.xwiki.extension.distribution.internal.job.DistributionRequest">
    <id>
      <string>distribution</string>
    </id>
    <properties>
      <entry>
        <string>wiki</string>
        <string>xwiki</string>
      </entry>
      <entry>
        <string>interactive</string>
        <boolean>true</boolean>
      </entry>
      <entry>
        <string>user.reference</string>
        <null/>
      </entry>
    </properties>
    <verbose>true</verbose>
  </request>
  <startDate>2021-09-27 16:12:28.75 UTC</startDate>
  <endDate>2021-09-27 16:36:01.182 UTC</endDate>
  <isolated>true</isolated>
  <canceled>false</canceled>
  <cancelable>false</cancelable>
  <serialized>true</serialized>
  <quesionEnd>-1</quesionEnd>
  <distributionExtension>
    <id>org.xwiki.platform:xwiki-platform-distribution-docker</id>
    <version class="org.xwiki.extension.version.internal.DefaultVersion" serialization="custom">
      <org.xwiki.extension.version.internal.DefaultVersion>
        <string>12.10.9</string>
      </org.xwiki.extension.version.internal.DefaultVersion>
    </version>
  </distributionExtension>
  <stepList>
    <org.xwiki.extension.distribution.internal.job.step.WelcomeDistributionStep>
      <stepId>welcome</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.WelcomeDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.FirstAdminUserStep>
      <stepId>firstadminuser</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.FirstAdminUserStep>
    <org.xwiki.extension.distribution.internal.job.step.FlavorDistributionStep>
      <stepId>extension.flavor</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.FlavorDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.CleanExtensionsDistributionStep>
      <stepId>extension.clean</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.CleanExtensionsDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.OutdatedExtensionsDistributionStep>
      <stepId>extension.outdatedextensions</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.OutdatedExtensionsDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.EventMigrationStep>
      <stepId>eventmigration</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.EventMigrationStep>
    <org.xwiki.extension.distribution.internal.job.step.ReportDistributionStep>
      <stepId>report</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.ReportDistributionStep>
  </stepList>
  <currentStateIndex>6</currentStateIndex>

This one is triggered when you have more events (+10 to be safe) in the legacy database than in the new Solr core. And indeed this one can be skipped without clicking CANCEL.