I have a question about a change made over 10 years ago in the Solr search schema. :-/
The analyzer for the field type of “filename” has been changed because of XWIKI-10225 “Can not find attachments with numbers as name”. This includes some tokenization, but also the line:
This removes all tokens that are shorter that 2 chars or longer that 13 chars. I guess I understand why the 2, but where do the 13 come from?
Some users got a bit complainerish about this, as their attachment names have words longer that that, like `Formularvorlage.pdf` (15 chars without suffix). You guessed it, German language - them are good at stringing words together. Something like `Supercalifragilisticexpialidocious.txt` would not found by filename, either, however.
Is there a reason why the maximal length of a token is 13, or is it just there because a maximal length has to be given?
I don’t remember where the 13 limit comes from. It can definitely be increased. My reasoning might have been:
users generally don’t type tokens that long when performing a free text search query (the type text_filename we’re discussing is used by the filename field which is designed to be matched by free text searches)
there is another field, filename_exact, that stores a single token, consisting of the entire file name, and which is designed to be used for exact matches (or prefix / suffix)
if a file name contains long tokens, then maybe exact / prefix matching, using the filename_exact field, is more suited
For instance, I wouldn’t type the entire “Supercalifragilisticexpialidocious” to find the file, I would type maybe “Super” and wait for the live search to give me some suggestions (expecting the live search to match the prefix against the filename_exact field), and continue typing until I get the file I’m looking for or an empty set of results. But it’s true that the problem with the filename_exact field is that it is case-sensitive, and you might not know the case when searching for your file.
In the end, I’m not against increasing the limit, I don’t see anything wrong with that, but I’m curious how your users are searching for files with such long tokens in their names.
Well, the thing is that these long tokens are also removed from the index.
I.e. if the filename is formularvorlage.docx and users search for formular they get no search results. (It is a bit different e.g. for formularvorlage_1999.docx where they do get results for 1999 as that part ends up in the index - but only that.
Oh, and indeed filename_exact is case sensitive - I did not notice but the first test user instantly complained about that when I suggested it as an alternative.
I can see a scenario where a user notices a long and complex word (or identifier) somewhere, copies it, and then pastes it in the search bar to find related resources in the wiki.
Yes, that’s why I was suggesting to use the filename_exact field instead.
So we have a few options:
increase the limit, but how much?
remove the limit (it would be limited by the database column used to store the attachment metadata)
use the filename_sort field if your search targets only attachments
add a field similar to filename_sort (lowercase string token) to document entries in the Solr index, multivalued, so that you can target documents with your search as well
Yes, but in that case, I would expect the exact (string) match to kick in.
Actually users are mostly looking at the autocompletion / search suggests, not the actual search page. That is why the are not seeing results from the filename_exact
However when I add ^debug=true to see the query used in the main search page, I only see (filename:formular*)^5.0 with the default configuration, and no filename_exact anywhere.
Then I’d prefer to keep using the filename index anyway, if only it is not case sensitive.
Removing the limit is not possible. (I tried that, solr fails to start.) Maybe it is ok to set a maxValue with the same length as the database column? (I do not mean als a local customization, but maybe as a change to the default config delivered with the platform.)