Changing (default) Solr tokenisation (EdgeNGramFilterFactory)

Wardenburg · March 7, 2023, 11:36am

tl;dr: I’m looking to add solr Edge N-Gram Filter to the solr indexing (prefferably for page titles alone)

Issue example page with title 'ATS8550 ’ is not found when user queries ‘8550’ (false negative)

Additional parameters can be taught to user (i.e. wildcard * , fuzzy/proximity searches)
-This will fail in this example however (8550* fails due to 8550 being end of string)

The search engine functioning is critical to adoption/project succes so I’m hoping to add n-gram tokens so that 8550 also returns ATS8550 without additional parameters. (I.e. this works in db query mode but solr seems superior in many other areas).

I’ve been diving into the solr configuration options by reading the following:

so far I’ve added property.XWiki.TagClass.tags^20.0 to Main/SolrSearchConfig:

'queryFields': {
    'DOCUMENT': 'title^10.0 name^10.0
                 doccontent^2.0
                 objcontent^0.4 filename^0.4 attcontent^0.4 doccontentraw^0.4
                 author_display^0.08 creator_display^0.08
                 comment^0.016 attauthor_display^0.016 spaces^0.016
                 property.XWiki.TagClass.tags^20.0',
    'ATTACHMENT': 'filename^5.0 attcontent attauthor_display^0.2',
    'OBJECT': 'objcontent',
    'OBJECT_PROPERTY': 'propertyvalue'
  }

I’d also like to add solr Edge N-Gram Filter to the solr indexing as shown in Analyzers | Apache Solr Reference Guide 6.6

I’ve found the schema’s in /var/lib/xwiki/data/store/solr/extension_index/conf/managed-schema.xml but the header rather clearly says not to edit, so I’m wondering where/how to add the EdgeNGramFilterFactory, it’d probably be something along the edited out lines;

    <analyzer type="query">
      <tokenizer class="org.apache.lucene.analysis.standard.StandardTokenizerFactory"/>
      <filter class="org.apache.lucene.analysis.core.LowerCaseFilterFactory"/>
<!--
      <filter class="org.apache.solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="6"/>
-->
    </analyzer>

Any ideas/suggestions on how to index N-grams to the (page title) indexation?

Note: as a general note I’d say the filtering options in xwiki are very extensive: due to these filtering tools Id say search results should be aimed towards high recall