Squash auto-saved document revisions?

mflorea · March 17, 2025, 11:34am

Hi everyone,

When editing in realtime the auto-save can possibly create a lot of document revisions (with partial changes). This makes it harder to work with the document history. There is pagination, but (until we move to live data) you can’t filter / search for a specific revision so you have to go through all the auto-saved revisions. To overcome this we could implement a way to group (collapse) revisions coming from realtime editing, BUT:

it doesn’t seem easy to do
and I’m not convinced that all those auto-saved revisions have long-term value (especially when the auto-save interval is very short)

With this in mind, I’m wondering if it’s not better to simply squash the auto-saved revisions, when possible. The realtime editing session could keep track of (and synchronize)

the initial revision (when the editing session starts)
and all the revisions authored from that editing session

With this we can determine if a squash is possible or not. As long as there is no revision created outside the editing session we should be able to squash. If the document is saved outside the editing session we could still do a partial squash, by squashing only the auto-saved revisions since that “outside” revision.

But what do I mean by squash? I’m thinking we could:

collect the contributors from all the revisions to be squashed
determine the contributor with the least script access right among those collected; if there are more, prefer the one from a more recent revision; if there are still more, chose at random?
collect the change summaries from all the revisions to be squashed (ignoring duplicates)
delete the revisions to be squashed
create a new revision with:
- original author: the user that triggered the squash
- effective author: the contributor with the least script access right
- collected contributors (see Change attribution when editing in realtime )
- content submitted by the save request
- aggregated change summary

On the UI side, I imagine there could be 3 ways a user could leave the editing session:

without triggering a squash, simply returning to view mode; this would be the default behavior when there are other users remaining in the editing session
with a silent / automatic squash (no question asked), when you are the last one leaving the editing session, if a squash is possible, and it makes sense (i.e. there are at least 2 auto-saved revision to squash); this behavior could be turned off with a configuration
with an explicit squash (again when possible), where the user can modify the aggregated change summary and chose between a minor / major version

WDYT? Do you see any problem with deleting document revisions (from the top of the document history) and replacing them with an “aggregated” (squashed) revision? Are there any security implications I’ve missed?

Thanks,
Marius

tmortagne · March 17, 2025, 2:17pm

How long after those versions were created are we talking about ?

Deleting a version just seconds (or more depending on the load/speed) after it’s been created might create quite a mess (or at the very least big stack traces in the log) for asynchronous tasks for which the version is important (indexing tasks like the mentions, cluster event dispatching, replication, etc.).

And not even talking about reusing a version number that was just created and deleted.

Note that deleting the old versions before storing the new one is probably an important data loss risk.

mleduc · March 17, 2025, 2:58pm

Why not just store “a lot” of minor revisions, and allow annotating them with metadata. Would it really create storage issues?

We could instead improve the UI to be able to:

display only relevant revisions
take it as an opportunity to paginate the history

Migrating to Live Data seems much cheaper than addressing the technical complexity that comes with squashing.

mflorea · March 17, 2025, 5:28pm

The user could decide to do this at any moment during the realtime editing session, event seconds after an auto-save is triggered. But the same can be done currently from the document history UI. You’re basically saying that deleting (recent) document revisions from the document History UI can break some XWiki modules. In this case we should:

either bulletproof those modules
or prevent / disable the delete action for a document revision until it has been analyzed asynchronously

But I understand that deleting revisions is not as simple as I thought.

Indeed. I wanted to avoid having holes in the history.

More revisions certainly require more disk space, but I’m not sure if it’s significant when compared with attachments for instance. My concern is not the disk space but the usability of the history UI. With hundreds or thousands of revisions it will be harder for the user to find relevant revisions. The auto-saved revisions create a lot of noise.

Migrating to Live Data would certainly improve the History UI, but I’m not sure if it’s enough. We need a way to remove the noise produced by the auto-saved revisions, ideally by grouping them under a “meta” realtime editing revision. This means:

adding support for annotating / tagging the revisions as you said, e.g. to indicate which revisions were auto-saved from a realtime editing session
design a UI that groups revisions based of the metadata (allowing you to expand / collapse the group)

For the first item, the complex part is to define precisely what kind of metadata can be added to a revision (so far we have only the realtime use case, but we need to make it generic).

For the second item, I’m not sure what kind of UI would work, and I’m afraid the UI might become too complex.

Thanks,
Marius

Simpel · March 17, 2025, 6:42pm

I could image this squashing with first save the squash as a new one and then delete all those minor version between. I wouldn’t mind if there are missing some minor version numbers when I could see that it is a squash save in the history.