Clustering in Kubernetes not working (second instance failing to start properly)

As an aside, I must say that I’m very impressed with XWiki and what it is capable of. It seems like whatever you want to do can be done, there’s usually a setting for it somewhere and really the only challenge is sometimes finding out how.

You’ve created an extremely capable platform and to be done as an open source project is also hugely commendable. This could have very easily been a closed source product. You’re engaged with the Forum and on the Matrix channel. I have to say that you’re a wonderful example of how a great open source project can be run.

Just wanted to say Thank You :slightly_smiling_face:

3 Likes

Thanks a lot Alex, that’s heart-warming and we often get more complaints than praises so that feels good :slight_smile:

I’ve taken the liberty of putting your testimonial at Testimonials (XWiki.org) (I hope it’s fine with you!).

Thanks

1 Like

Thanks a lot Alex, that’s heart-warming and we often get more complaints than praises so that feels good :slight_smile:

You’re welcome and hopefully you can pass on my thanks to the rest of the team. You’re doing a great job, and I have no reason to doubt you’ll all keep doing so :slight_smile:

I’ve taken the liberty of putting your testimonial at Testimonials (XWiki.org) (I hope it’s fine with you!).

Of course, no problem.

Hi. I’ll tack this question on to this thread as I’ve run into another issue which is cluster related and might be related to storage.

I’ve now got multiple instances of xwiki running with just the extensions/repository and store/files as shared directories/storage.

The XWiki install was created when there was one instance running. I then scaled to 2 instances. When traffic hits the second instance, it tries to redirect to the distribution wizard, but then the first node receives that traffic which redirects back to /main/WebHome, which hits the second node and that tries to redirect back to WebHome, etc.

From a quick look at the distribution wizard code, I think that it uses the job status to determine whether it needs to run or not. As such, I copied the job status from the storage in instance 1 into the storage for instance 2. That didn’t appear to do much but once I restarted instance 2 the issue was resolved.

My question is, does this suggest that the jobs dir needs to be a shared storage or (and I suspect this is more likely) does it indicate that my cluster nodes are not communicating properly? Or something else entirely?

Thanks, in advance.

Yes and no. If there is nothing to do, you won’t get the DW but if you explicitly cancel steps (for example you have invalid extensions installed which trigger the “Extensions” step) then this cancellation is stored in the job status.

This feels like a bad load balancer setup, which should ideally be a bit more sticky (keep the same session on the same node). Otherwise, you are going to have a lot of problems with everything job based if the load balancer randomly ask for the job status on the wrong node.

I would not recommend it as jobs are by nature associated to a specific node. Sharing it would allow a node to access the logs of finished job which have run on another node (which can probably be interesting for some kinds of jobs based features without clustering support), but it would lead to a lot of conflicts (notably with job based features having clustering support).

Thanks @tmortagne. I’ll keep the Job storage instance specific, and I’ll make the Load Balancer more session sticky (I thought I’d done that but on seeing the loop I could see wasn’t working as I’d thought).

Sorry but I don’t understand how that resolves why the second instance thinks the Distribution Wizard is necessary. Or are you saying that if the jobs dir is to be kept instance specific but DW is still triggering then it implies the cluster comms aren’t working properly?

When a new node joins the cluster, and as the Jobs dir is instance specific, how does the instance determine whether to run the DW or not? Is there something specific I can look for that would indicate that the new instance has caught up with the cluster properly? Obviously, whether or not I see the DW is an indication of things working but, ideally there’d be something specific I can look for.

I’ll keep plugging at it in the meantime as I’m sure you’re very busy.

Cheers

If you see the DW then it means you cancelled a step (the most likely being that you have unresolved invalid extensions caused by some upgrade which are still installed) if you fix those you won’t have the DW anymore, status file or not. The only other alternatives are:

  • actually execute the DW and cancel again the steps when you create a new instance
  • copy the status when you create a new node

Each step have triggers, when you get the DW click next, and you will see which step is enabled which will tell you why you have the DW.

I didn’t cancel any steps when I was going through the UI and looking at the jobs/status/distribution/status.xml output, it appears that all of the steps are marked as completed.

I sorted the session stickiness in the LB and this time ran through the DW. It did in fact skip most steps and I just need to click continue on Events Migration and Pages. This was just a manual step I was not expecting and because the initial page of the DW appeared to be the same as when doing a fresh xwiki install, I thought something was wrong with my setup/cluster.

I’ll experiment with distribution.job.interactive to see if that can get me what I’d ideally want.

Thanks again for the help.

<?xml version="1.0" encoding="UTF-8"?>
<org.xwiki.extension.distribution.internal.job.DistributionJobStatus>
  <progress>
    <listenerName>org.xwiki.job.internal.DefaultJobProgress_1240794181</listenerName>
    <rootStep>
      <message>
        <message>Progress with name [{}]</message>
        <marker class="org.xwiki.logging.marker.TranslationMarker">
          <name>xwiki.translation</name>
          <translationKey>job.progress</translationKey>
        </marker>
        <argumentArray>
          <null/>
        </argumentArray>
      </message>
      <index>0</index>
      <offset>1.0</offset>
      <maximumChildren>7</maximumChildren>
      <childSize>0.14285714285714285</childSize>
      <children>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>0</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>958426906551450</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>15114032127</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>1</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>958442020610523</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>16593841728</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>2</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>958458614477678</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>1351122066606</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>3</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736573188</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>9575</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>4</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736586269</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>5211</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>5</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736594263</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>5210</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
        <org.xwiki.job.internal.DefaultJobProgressStep>
          <index>6</index>
          <offset>1.0</offset>
          <maximumChildren>-1</maximumChildren>
          <childSize>0.0</childSize>
          <startTime>959809736602152</startTime>
          <finished>true</finished>
          <levelFinished>true</levelFinished>
          <elapsedTime>30277372517</elapsedTime>
        </org.xwiki.job.internal.DefaultJobProgressStep>
      </children>
      <startTime>958426906551450</startTime>
      <finished>true</finished>
      <levelFinished>true</levelFinished>
      <elapsedTime>1413108019114</elapsedTime>
    </rootStep>
  </progress>
  <jobType>distribution</jobType>
  <state>FINISHED</state>
  <request class="org.xwiki.extension.distribution.internal.job.DistributionRequest">
    <id>
      <string>distribution</string>
    </id>
    <properties>
      <entry>
        <string>wiki</string>
        <string>xwiki</string>
      </entry>
      <entry>
        <string>interactive</string>
        <boolean>true</boolean>
      </entry>
      <entry>
        <string>user.reference</string>
        <null/>
      </entry>
    </properties>
    <verbose>true</verbose>
  </request>
  <startDate>2021-09-27 16:12:28.75 UTC</startDate>
  <endDate>2021-09-27 16:36:01.182 UTC</endDate>
  <isolated>true</isolated>
  <canceled>false</canceled>
  <cancelable>false</cancelable>
  <serialized>true</serialized>
  <quesionEnd>-1</quesionEnd>
  <distributionExtension>
    <id>org.xwiki.platform:xwiki-platform-distribution-docker</id>
    <version class="org.xwiki.extension.version.internal.DefaultVersion" serialization="custom">
      <org.xwiki.extension.version.internal.DefaultVersion>
        <string>12.10.9</string>
      </org.xwiki.extension.version.internal.DefaultVersion>
    </version>
  </distributionExtension>
  <stepList>
    <org.xwiki.extension.distribution.internal.job.step.WelcomeDistributionStep>
      <stepId>welcome</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.WelcomeDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.FirstAdminUserStep>
      <stepId>firstadminuser</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.FirstAdminUserStep>
    <org.xwiki.extension.distribution.internal.job.step.FlavorDistributionStep>
      <stepId>extension.flavor</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.FlavorDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.CleanExtensionsDistributionStep>
      <stepId>extension.clean</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.CleanExtensionsDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.OutdatedExtensionsDistributionStep>
      <stepId>extension.outdatedextensions</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.OutdatedExtensionsDistributionStep>
    <org.xwiki.extension.distribution.internal.job.step.EventMigrationStep>
      <stepId>eventmigration</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.EventMigrationStep>
    <org.xwiki.extension.distribution.internal.job.step.ReportDistributionStep>
      <stepId>report</stepId>
      <state>COMPLETED</state>
    </org.xwiki.extension.distribution.internal.job.step.ReportDistributionStep>
  </stepList>
  <currentStateIndex>6</currentStateIndex>

This one is triggered when you have more events (+10 to be safe) in the legacy database than in the new Solr core. And indeed this one can be skipped without clicking CANCEL.