Do not store the full hierarchy of the security cache in Infinispan

XWiki has a security cache that speeds up access control checks. For this, it caches the security rules of documents, the hierarchy of users and groups and the actual access entries of a specific user and reference in an Infinispan cache. The actual access entries are what should be used most frequently. The whole cache is organized as a tree-like parents/children structure (a directed acyclic graph to be precise). Users have groups as parent, documents are nested according to their hierarchy in the wiki and the actual access entries have the document and the user as parent. The primary goal of this is to be able to be able to invalidate everything that depends on, e.g., a group membership of a user when that group membership is changed. For this purpose, when an item is removed from the cache, everything that depends on it is removed, too. This hierarchical information is also used to be able to quickly compute access entries when they are missing.

Now, unfortunately, there seem to be problems when this security cache becomes full. Basically, what seems to happen is that entries that are on the higher levels of the hierarchy are removed while we try inserting items that depend on them. Then, the insertion fails. At the moment, this can lead to errors when this happens repeatedly, as documented in XWIKI-18508. We can work around this by not relying on the cache insertion to succeed. However, this might lead to bad performance.

Infinispan doesn’t evict entries randomly but uses an algorithm named TinyLFU. For this, the cache maintains an approximation of the access frequency of recently accessed objects. Based on this approximation, it can also decide not to accept entries if the candidate that would need to be evicted has been more frequently accessed. This algorithm doesn’t take our hierarchy into account and I think this can lead to really bad decisions. When everything is working fine and basically the “first caching level” which consists of the actual access entries always gives a cache hit, we never access the items that are higher in the hierarchy. This can mean that the cache will decide to evict these items that are higher in the hierarchy and not allow their re-insertion as long as there are still other entries in the cache that have been more frequently accessed. It could therefore happen that, for example, a whole sub-wiki cannot be cached anymore for some time until we tried to insert these higher-level items so frequently that they are considered important again.

I’m therefore proposing to change how we use the (internal) Infinispan cache. My proposal is to use it only to store security access entries but not the document rules/user groups entries. Security access entries are never directly externally invalidated (even though we have an API for that there are no uses of that API in xwiki-platform) and also should not invalidate any other entries when they are disposed. Thus the evication strategy of the cache should work as expected, giving us a good cache hit rate. For the remaining entries I propose the following:

  • Store them in a concurrent hash map for easy access similar to the Infinispan cache.
  • Build a hierarchy as before. Explicitly removing an entry due to invalidation triggers the removal of all children from the hierarchy and all derived security access entries from the cache (as it is the case right now).
  • Entries from the hash map are additionally removed when they are no longer reachable through parent-relationships from a security access entry in the cache. This ensures that the hash map won’t grow infinitely.

For the actual implementation of this eviction strategy based on reachability I’m open to suggestions. One possibility could be to use weak references both in the hash map and the list of children and let the garbage collector do the work. While building the hierarchy, we would ensure that we always hold a reference to all “lowest” items in the hierarchy to prevent their garbage collection while the security cache entry from which they are reachable hasn’t been inserted yet. Alternatively, we could also explicitly track which items are reachable through reference counts.

I’m proposing this here as while I believe this should have a very positive effect on the security cache in particular when it is full, I think there is one important downside: It will no longer be possible to explicitly control the total size of the cache as the hierarchical part would not be limited and not visible from the monitoring. In wikis with a large number of groups, the number of entries in this hierarchy might be significant. However, it is also currently not fully controllable. For example, a document that has no access rules is counting to the security cache size in the same way as a potential child of it that has 500 rules. Further, as the number of potential security access entries in the cache is basically the number of pages times the number of active logged in users, I believe that the number of security access entries that will still be in the Infinispan cache should vastly outnumber the size of the hierarchy that we do not keep in the cache, which is limited by the number of documents plus the number of active users and groups. Still, there might be corner-cases where this is not true.

Are there any further concerns with this proposal, alternative ideas or ideas how to improve it?

I’ve just noticed that instead of a concurrent hash map, we could also use a cache without limit to store the hierarchy. The question how to evict entries remains, but Caffeine, the underlying library below Infinispan supports using weak references to values which might enable exactly what I have in mind just with a proper cache of unlimited size instead of a concurrent hash map which probably enables JMX monitoring. With a quick search, I couldn’t find if Infinispan exposes this feature of Caffeine or not, but I could imagine that it does not as it seems quite incompatible with a distributed cache. This would be a pro argument for implementing XCOMMONS-1596.

While thinking some more about this, I noticed that it might still be a good idea to additionally store entries that are part of the hierarchy also in the cache. This is to ensure that users or documents that are frequently accessed but for which the resulting access entries are not stored in the cache are still cached and not discarded as no access entry references them. I’ll briefly re-iterate the revised version of my proposal:

  • Make children in security cache entries weak references. We might need to check if we can automatically remove disposed entries as discussed below (this seems possible in general, we just need to check if there is some data structure that is not too complex and provides this for us).
  • Store all entries of the security cache that are not access entries also in a (concurrent) hash map as weak references. This should be done in a way similar to https://github.com/hzulla/WeakValueHashMap such that values are automatically removed when they are garbage collected. We might also be able to use a Caffeeine cache here that supports this behavior.
  • When a value is requested that is not an access entry, first check the cache, then check in this hash map/second cache. If the value is found in the map, insert it into the cache again (this might fail, though) and return it.
  • Make it possible for callers of the security cache to get a strong reference to the security cache entry such that callers can prevent entries from being evicted from this map/second cache while they are loading a hierarchy of entries. I haven’t looked at how to implement this. We should not hand out actual references to the security cache entries I think but we might hand out a class that holds a private reference (without any access options). Alternatively, there could also be a way to create some kind of “session” into which entries can be collected that won’t be discarded as long as the session is valid.
  • Do not remove children from the cache if an entry is evicted from the cache. Instead, the remove operation that is used for cache invalidation due to document/group changes needs to explicitly do a traversal of the tree to remove all entries from the cache. This should also explicitly include the map/second cache. Note that we already explicitly pause cache invalidations while loading values for a security access entry into the cache so this does not contradict with the guarantees given above.

As we are already using write and read locks on the security cache there is no real need at the moment to make any of these write operations thread-safe.

Just for documentation as I’ve found this again:

We can use ReferenceMap from Apache Commons Collections as weak value map for storing the hierarchy of entries, it should support everything we need. Iirc we don’t really need a thread-safe map as we already have read/write locks for cache accesses as other operations aren’t thread-safe, either.