Solr - Relavancy order incorrect when using multiple words in a query

Hello All,

Continuing my XWiki experimentation after being away for a while. Also finally got around to upgrading from 9.5.1 to 9.9.

I first noticed this in 9.5.1, but still see the behaviour in 9.9. When I enter a search query with multiple terms, I would expect to get the pages with these terms appearing closest together at the top of the results (this is one of Solr’s selling points), instead, the highest results are pages with the highest occurrence of the individual terms. Sometimes this is because they appear in the title, but in the below example I would expect the bottom one to appear at the top.

Searched for: variations environment operate
First 3 results (excluding where one or more of the terms appears in the title):

Is this by design? Can I do anything to make the “pf” parameter more sensitive (see The DisMax Query Parser | Apache Solr Reference Guide 6.6)

Ideally I would like a document where these terms appear in close proximity to have a higher score than a document with a single one of the terms in the title.

Thanks,
Ben

pf does not seems to be set at all in the code actually and I agree it sounds like something that should be.

You should create an issue about that on http://jira.xwiki.org.

If you want to work on that I would say it probably require the following:

You should check the search debug mode where the search results scores are explained in detail. See http://extensions.xwiki.org/xwiki/bin/view/Extension/Solr+Search+Application#HSearchDebugMode .

Hope this helps,
Marius

Results of debug mode:

Afraid I don’t really understand what I’m looking at.

MBSE:

10.624639 = sum of:
  10.624639 = max of:
    10.624639 = max of:
      10.624639 = weight(doccontent__:environment in 5015) [SchemaSimilarity], result of:
        10.624639 = score(doc=5015,freq=1.0 = termFreq=1.0
), product of:
          2.0 = boost
          3.4364655 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            6.0 = docFreq
            201.0 = docCount
          1.5458672 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            610.90546 = avgFieldLength
            83.591835 = fieldLength
    2.5365489 = max of:
      2.5365489 = weight(doccontentraw__:environment in 5015) [SchemaSimilarity], result of:
        2.5365489 = score(doc=5015,freq=1.0 = termFreq=1.0
), product of:
          0.4 = boost
          4.213608 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            7.0 = docFreq
            506.0 = docCount
          1.5049744 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            464.91898 = avgFieldLength
            83.591835 = fieldLength

Parallel Axis Gears:

9.015588 = sum of:
  9.015588 = max of:
    9.015588 = max of:
      9.015588 = weight(doccontent__:environment in 2314) [SchemaSimilarity], result of:
        9.015588 = score(doc=2314,freq=1.0 = termFreq=1.0
), product of:
          2.0 = boost
          3.4364655 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            6.0 = docFreq
            201.0 = docCount
          1.311753 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            610.90546 = avgFieldLength
            256.0 = fieldLength
    1.9041862 = max of:
      1.9041862 = weight(doccontentraw__:environment in 2314) [SchemaSimilarity], result of:
        1.9041862 = score(doc=2314,freq=1.0 = termFreq=1.0
), product of:
          0.4 = boost
          4.213608 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            7.0 = docFreq
            506.0 = docCount
          1.1297837 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            464.91898 = avgFieldLength
            334.36734 = fieldLength

Development Engineering:

8.808027 = sum of:
  2.807226 = max of:
    2.807226 = max of:
      2.807226 = weight(doccontent__:variations in 2720) [SchemaSimilarity], result of:
        2.807226 = score(doc=2720,freq=1.0 = termFreq=1.0
), product of:
          2.0 = boost
          3.2933648 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            7.0 = docFreq
            201.0 = docCount
          0.42619422 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            610.90546 = avgFieldLength
            2621.44 = fieldLength
    0.38983455 = max of:
      0.38983455 = weight(doccontentraw__:variations in 2720) [SchemaSimilarity], result of:
        0.38983455 = score(doc=2720,freq=1.0 = termFreq=1.0
), product of:
          0.4 = boost
          4.0884447 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            8.0 = docFreq
            506.0 = docCount
          0.23837583 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            464.91898 = avgFieldLength
            4096.0 = fieldLength
  2.9292035 = max of:
    2.9292035 = max of:
      2.9292035 = weight(doccontent__:environment in 2720) [SchemaSimilarity], result of:
        2.9292035 = score(doc=2720,freq=1.0 = termFreq=1.0
), product of:
          2.0 = boost
          3.4364655 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            6.0 = docFreq
            201.0 = docCount
          0.42619422 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            610.90546 = avgFieldLength
            2621.44 = fieldLength
    0.40176892 = max of:
      0.40176892 = weight(doccontentraw__:environment in 2720) [SchemaSimilarity], result of:
        0.40176892 = score(doc=2720,freq=1.0 = termFreq=1.0
), product of:
          0.4 = boost
          4.213608 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            7.0 = docFreq
            506.0 = docCount
          0.23837583 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            464.91898 = avgFieldLength
            4096.0 = fieldLength
    1.1475651 = max of:
      1.1475651 = weight(attcontent__:environment in 2720) [SchemaSimilarity], result of:
        1.1475651 = score(doc=2720,freq=3.0 = termFreq=3.0
), product of:
          0.4 = boost
          3.4904284 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            2.0 = docFreq
            81.0 = docCount
          0.8219371 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            3.0 = termFreq=3.0
            1.2 = parameter k1
            0.75 = parameter b
            4071.9507 = avgFieldLength
            21399.51 = fieldLength
  3.0715985 = max of:
    3.0715985 = max of:
      3.0715985 = weight(doccontent__:operate in 2720) [SchemaSimilarity], result of:
        3.0715985 = score(doc=2720,freq=1.0 = termFreq=1.0
), product of:
          2.0 = boost
          3.6035197 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            5.0 = docFreq
            201.0 = docCount
          0.42619422 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            610.90546 = avgFieldLength
            2621.44 = fieldLength
    0.41541365 = max of:
      0.41541365 = weight(doccontentraw__:operate in 2720) [SchemaSimilarity], result of:
        0.41541365 = score(doc=2720,freq=1.0 = termFreq=1.0
), product of:
          0.4 = boost
          4.356709 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:
            6.0 = docFreq
            506.0 = docCount
          0.23837583 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:
            1.0 = termFreq=1.0
            1.2 = parameter k1
            0.75 = parameter b
            464.91898 = avgFieldLength
            4096.0 = fieldLength

@tmortagne, thanks for the input.

I’ll have a look to see if I can figure out how to implement what you describe in my instance of XWiki.

Depending on what is discussed here I’ll then report an issue on Jira.

The difference in score comes from the document content fieldLength: 83.591835 vs. 256.0 vs. 2621.44. The importance of a token decreases with the length of the field that contains it. So the longer the document content is, the less important its tokens are.

Indeed, the distance between the search terms doesn’t seem to influence the score.

OK, I’ve implemented @tmortagne’s solution in my instance of XWiki and it works brilliantly!!! Much more Google-ish!

Still playing around with the weighting, will leave another comment when I get something that is robust.

The only curiosity I noticed was that the “doccontent” field wouldn’t appear in the full query string (as returned in debug mode) and therefore did nothing. So I had to specify “doccontent_en”, which I don’t really like as I think I’m hardcoding the language.

I would definitely recommend this be implemented as standard. I will add it as an issue in Jira.

Ben

Pull requests are welcome :wink:

Pull request submitted: Added Solr sloppy phrase matching capability and config by benmegson · Pull Request #617 · xwiki/xwiki-platform · GitHub

1 Like

Would be nice to also create a Jira issue about this improvement and reference it in the commits (I guess we can do that as part of the squash merge).