XWiki has a security cache that speeds up access control checks. For this, it caches the security rules of documents, the hierarchy of users and groups and the actual access entries of a specific user and reference in an Infinispan cache. The actual access entries are what should be used most frequently. The whole cache is organized as a tree-like parents/children structure (a directed acyclic graph to be precise). Users have groups as parent, documents are nested according to their hierarchy in the wiki and the actual access entries have the document and the user as parent. The primary goal of this is to be able to be able to invalidate everything that depends on, e.g., a group membership of a user when that group membership is changed. For this purpose, when an item is removed from the cache, everything that depends on it is removed, too. This hierarchical information is also used to be able to quickly compute access entries when they are missing.
Now, unfortunately, there seem to be problems when this security cache becomes full. Basically, what seems to happen is that entries that are on the higher levels of the hierarchy are removed while we try inserting items that depend on them. Then, the insertion fails. At the moment, this can lead to errors when this happens repeatedly, as documented in XWIKI-18508. We can work around this by not relying on the cache insertion to succeed. However, this might lead to bad performance.
Infinispan doesn’t evict entries randomly but uses an algorithm named TinyLFU. For this, the cache maintains an approximation of the access frequency of recently accessed objects. Based on this approximation, it can also decide not to accept entries if the candidate that would need to be evicted has been more frequently accessed. This algorithm doesn’t take our hierarchy into account and I think this can lead to really bad decisions. When everything is working fine and basically the “first caching level” which consists of the actual access entries always gives a cache hit, we never access the items that are higher in the hierarchy. This can mean that the cache will decide to evict these items that are higher in the hierarchy and not allow their re-insertion as long as there are still other entries in the cache that have been more frequently accessed. It could therefore happen that, for example, a whole sub-wiki cannot be cached anymore for some time until we tried to insert these higher-level items so frequently that they are considered important again.
I’m therefore proposing to change how we use the (internal) Infinispan cache. My proposal is to use it only to store security access entries but not the document rules/user groups entries. Security access entries are never directly externally invalidated (even though we have an API for that there are no uses of that API in xwiki-platform) and also should not invalidate any other entries when they are disposed. Thus the evication strategy of the cache should work as expected, giving us a good cache hit rate. For the remaining entries I propose the following:
- Store them in a concurrent hash map for easy access similar to the Infinispan cache.
- Build a hierarchy as before. Explicitly removing an entry due to invalidation triggers the removal of all children from the hierarchy and all derived security access entries from the cache (as it is the case right now).
- Entries from the hash map are additionally removed when they are no longer reachable through parent-relationships from a security access entry in the cache. This ensures that the hash map won’t grow infinitely.
For the actual implementation of this eviction strategy based on reachability I’m open to suggestions. One possibility could be to use weak references both in the hash map and the list of children and let the garbage collector do the work. While building the hierarchy, we would ensure that we always hold a reference to all “lowest” items in the hierarchy to prevent their garbage collection while the security cache entry from which they are reachable hasn’t been inserted yet. Alternatively, we could also explicitly track which items are reachable through reference counts.
I’m proposing this here as while I believe this should have a very positive effect on the security cache in particular when it is full, I think there is one important downside: It will no longer be possible to explicitly control the total size of the cache as the hierarchical part would not be limited and not visible from the monitoring. In wikis with a large number of groups, the number of entries in this hierarchy might be significant. However, it is also currently not fully controllable. For example, a document that has no access rules is counting to the security cache size in the same way as a potential child of it that has 500 rules. Further, as the number of potential security access entries in the cache is basically the number of pages times the number of active logged in users, I believe that the number of security access entries that will still be in the Infinispan cache should vastly outnumber the size of the hierarchy that we do not keep in the cache, which is limited by the number of documents plus the number of active users and groups. Still, there might be corner-cases where this is not true.
Are there any further concerns with this proposal, alternative ideas or ideas how to improve it?