Configure search exclusions

Hi everyone,

Some users have asked me if there is an easy way to configure some Solr search exclusions. ATM, a standard Solr search query performs the following “exclusions”:

  • access rights (search results don’t include pages that you can’t view)
  • hidden pages (based on your preference)
  • locale (only pages that match the current locale, or the default translation if there’s no translation for the current locale)
  • wiki (search is limited to the current subwiki, and in the case of the main wiki there is a wikisSearchableFromMainWiki configuration)

But it seems this is not enough. Some users would like to be able to exclude specific “spaces” (subtrees of the page hierarchy) from the search. So my first question is: do you think it’s useful to add a configuration for this?

If you find this useful then the next question is: should this be an index-time filter or a query-time filter?

  • Index-time:
    • Pro: Don’t index pages that are not meant to be searched.

    • Con: But, are you sure you’ll never need to perform a custom search query (outside the standard search page / quick search) on those pages? Plus there may be other features of XWiki that expect those pages to be indexed (because we’re starting to rely on Solr more and more).

    • Con: We’ll have to define a syntax for specifying exclusions. We can use regular expressions, but it’s not trivial to express and implement filters like “pages with an object of this type” or “pages created more than 10 years ago”.

    • Con: when modifying the configuration we’ll have to update the index:

      • remove entries that match the new exclusion filter
      • index entries that were matching the old filter but are not matching the new filter anymore (which doesn’t sound easy to do)

      And if you do a mistake expressing the exclusion filter you can end up deleting the entire search index.

  • Query-time:
    • Pro: we can use the Solr query syntax to express complex / advanced filters, e.g. -space_prefix:A.B
    • Pro: changing the exclusion filter doesn’t require updating the search index. If you do a mistake, the search results are back as soon as you fix the filter.
    • Con: time and space allocated on indexed pages that may never be queried

I find the query-time approach better. On the UI side, I would add a new sub-section, named “Searching” before the existing “Indexing” one, in the Solr search administration section. It would have a single field “Exclusions”, a text area, where administrators can write a filter query per line, like:

-space_prefix:A.B

(we could add the minus ourselves, and let the user simply specify a positive filter).

Modifying the main search page to obey this configuration should be easy. For the quick search / search suggest (from the top bar), it’s a bit more complex because each search suggest source has its own configuration. We could decide to:

  • apply the search excludes all the time (i.e. modify the SuggestSolrMacros that are used by all sources)
  • allow the search suggest source to indicate it they follow the standard search exclusions or not

I’d keep it simple and apply the exclusion filters all the time.

WDYT?

Thanks,
Marius

-1 for Indexing time, it’s error-prone, and it means that backlinks in those pages won’t be found and refactored anymore.

Query-time sounds good, what would be good though is to have a real query filter (I mean implementation of org.xwiki.query.QueryFilter) for Solr queries that can easily be added in any Solr query. That way, it would be easy to add in all places that use Solr queries. What I’m wondering is if we should support it also for database queries as users might quickly notice that they also want it in certain pickers that (still) rely on the database. If we want database query support, too, we should probably have a configuration that is a list of spaces instead of a Solr query.

1 Like

Yes, it would be more generic (and in that case I would move the configuration outside the Solr configuration section, just below the “Default search engine” configuration) but you lose the flexibility of the Solr query. Now, the users I’ve mentioned only asked (so far) to exclude subtrees from the page hierarchy, so a multi-select “space” picker should be enough (BTW, that picker would have to bypass the exclusion filter, which won’t work with an index-time filter :slight_smile: ). Let’s see what others think.

Thanks,
Marius

+1 for query time with a database agnostic configuration format.

1 Like

-1 for the same reason. The search core is now more of a page core, and it’s been used for much more than the standard search UI for a while now.

1 Like