while investigating XWIKI-21081 and the related issue XWIKI-21082, I found that Tomcat by default parses the request body as ISO-8859-1 or even ascii unless the browser sent an encoding (see its wiki for details). For this reason, we have a “Set Character Encoding” filter configured in web.xml in the WAR. Unfortunately, it turns out that this filter is currently not applied when the routing filter decides that the request should be handled by the ResourceReferenceHandlerServlet. This is because the routing filter is first in the filter chain and when it forwards a request, the request now targets resourceReferenceHandler and the encoding filter no longer matches.
There are several ways to fix the encoding problem:
Move the “Set Character Encoding” filter before the “RoutingFilter”.
Configure the “Set Character Encoding” filter to also match resourceReferenceHandler.
Set the character encoding using <request-character-encoding>UTF-8</request-character-encoding> in web.xml, this requires moving to Servlet 4.0.
Change RoutingFilter to set the character encoding in the case it forwards the request.
I’m +1 for option 1 as it’s the simplest, cleanest option that won’t cause problems with any servlet engines. Option 4 would be nice as it doesn’t require admins modifying web.xml but I’m not sure it is worth the effort/code duplication.
A similar problem concerns the RequestRestorer filter. My understanding is that this would mean that when the user is redirected to login from a resource reference handler due to expired login information, the original request won’t be restored when the user is redirected back to the resource reference handler. Again, we have some options:
Move the “RequestRestorer” filter before the “RoutingFilter” (after the “Set Character Encoding” filter).
Configure the “RequestRestorer” filter to also match resourceReferenceHandler.
I don’t think we also have an option that comes without modifying web.xml and again I’m +1 for the first option as it seems the simplest option.
At least for the character encoding, I’m very sure that we always want a default of UTF-8 as there should really be no difference in this regard between the legacy action servlet and the resource reference handler. Regarding the request restorer filter, I have no idea, I didn’t even know it existed until I discovered it during this search. Still, I would suggest to always apply it as it seems sensible and I don’t see why we should differentiate between the servlets.
I would choose option 3 for the following reasons:
It allows removing code that becomes unnecessary
We need to move to Servlet 4 anyway and all the servlet engine versions we support already support servlet 4. Actually Servlet 4 is pretty old now (the spec is at 6).
It doesn’t impact the URL architecture we have (good to keep the Routing Filter first and not redo our design)
For RequestRestorer, I don’t know enough what it does and the technical limitations as to why it would need to be before the routing filter. I’d be surprised that there aren’t RRHs (Resource Reference Handlers) that don’t need it. Without knowing more, my preference would be to execute it after the RoutingFilter, i.e. option 2.
I have no idea what moving to Servlet 4.0 means, but to me it seems quite risky for a bug fix that we would backport to the LTS branch (it is for fixing a serious regression, after all). From my understanding, it would also mean moving to Java EE 8 (from 7).
The point to be before the routing filter is to not skip it by default. We cannot expect that every developer who implements a RRH explicitly executes these filters that even we as core developers don’t know where not executing them can cause bugs in rare case (a login triggered during a POST request to the RRH for example). The filter does nothing if there is no parameter with a request id in the URL so there should be no harm to always execute it. Note that even option 2 would include always executing it, it would just always be executed after the routing filter (which has the consequence that routing decisions cannot be based on the restored request, but this seems an unlikely consequence).
Note that both options 1 are easy to implement and should be quite safe as a) the encoding filter is really simple and b) the request restorer does nothing unless a request parameter is present. On the other hand, I need help or significantly more time to implement options 2 or option 3 for the encoding problem and there is a significant risk of causing regressions. I most likely won’t be able to implement them for the next releases unless I drop other work from the roadmap.
Note: We need to move to Servlet 4+ anyway (it’s starting to be oldish, 4.0 was introduced in 2017); but we should do this only on master (possibly even wait for 16.x), and not on the LTS branch for sure.
I’m fine with 1/1 as a temporary solution (specially for the LTS), but I don’t think it’s the best design and we should consider it a technical debt that we’ll need to fix in the future.
Thanks for working on this (which is some bug that I probably introduced several years ago when introducing the routing filter…).