Switching to HTML5

MichaelHamann · January 4, 2022, 10:47am

As already hinted in the roadmap, I’m currently working on switching the WYSIWYG editing process to HTML5 which is currently still using XHTML 1.0. The main motivation is support for the <figure>/<figcaption>-tags to have native support in CKEditor for figure captions but there are also other reasons:

The Flamingo Skin already uses HTML5, so the page content is by default already rendered with HTML5 so we currently have a discrepancy between view and edit mode.
In lots of places, we already use HTML5 data-*-attributes that are not valid in XHTML 1.0.

There are a couple of places where we still use XHTML 1.0, in particular for example when UI extensions are rendered, frequently still XHTML 1.0 is used as shown in the tutorial and the HTML macro also currently cleans the HTML to be valid XHTML 1.0, see XRENDERING-509.

My proposal is to basically deprecate XHTML 1.0 and completely switch to HTML5. This will be a gradual switch and, at least for now, there will only be few differences between XHTML 1.0 and HTML5 and in particular we will still try producing HTML5 that is valid XML. The main difference for XWiki rendering is that there is no <tt>-tag anymore for monospace and verbatim content and figures are rendered using the <figure> and <figcaption>-tags.

In particular, this involves the following:

Add an option to HTMLCleaner to clean using HTML5. My proposal is to leave the default at HTML 4 in order to not to break existing usages.
Introduce an HTML5 parser. For now, this parser will mainly handle the parts of HTML5 that are compatible with XWiki syntax (e.g., no links wrapping flow content).
Change the WYSIWYG script service and the HTML converter it uses to use HTML5. This will be a breaking change in the sense that all existing methods will expect/produce HTML5 and for proper CKEditor support a new version of CKEditor will be needed. As we pass the HTML through HTMLCleaner, the script service will be very forgiving though if, e.g., the input still contains the <tt>-tag. I propose to add a property to the script service to ask for the HTML version such that for example CKEditor can detect if it is running on the new version or not.
Change CKEditor to support HTML5. This can be backwards-compatible or we could also release a version 2.0 of CKEditor that is no longer compatible with XWiki versions before that change.

Are there any objections, in particular to the changes of WYSIWYG script service which are not backwards-compatible? Is there any code besides CKEditor using this script service in a way that it would be affected by the change to HTML5?

There is another topic with respect to cleaning HTML5: In HTML5 a lot of things that were invalid in XHTML 1.0 are valid. HTML5 distinguishes between flow content and phrasing content, where the latter is a subset of the former. In paragraphs, only phrasing content can be used but as phrasing content including plain text is also flow content, there is no need to wrap plain text in a paragraph - <body>Hello World!</body> is perfectly valid. Further, we have now even more tags that are basically like <div>: Elements like <figure> and <figcaption> allow flow content as children, i.e., can contain lists, paragraphs etc. but also simply plain text. My suggestion is the following:

In HTMLCleaner, wrap all phrasing content that is directly below the <body>-tag in a paragraph similar to the existing BodyFilter even though this is not necessary. In this context, treat elements like <a>, <ins>, <del> that are only phrasing content when they contain phrasing content as phrasing content. [edit] I’ve just noticed that we already allow <ins> and <del> directly below the <body>-tag so this probably shouldn’t be changed.
In WikiModel, treat tags like <figure> and <figcaption> as “document” (similar to <div>), i.e., like an embedded document where inline content will always be wrapped in a paragraph and all content is allowed.

In particular, this means that <figure><img ... /><figcaption>Caption</figcaption></figure> will be parsed as <figure><p><img ... /></p><figcaption><p>Caption</p></figcaption></figure>. Note that the former HTML code is what CKEditor’s native caption support would produce and (probably) also needs as input. The new figure captions support would thus render the first version without paragraphs but accept the latter version as input.

vmassol · January 4, 2022, 1:06pm

I think you mean XHTML5 since the goal of HTML Cleaner is to generate valid XML (see http://htmlcleaner.sourceforge.net/ where it says: For any given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML). And the default is XHTML 1.0 AFAIK.

How do you implement this config? I don’t see any existing parameter to control this at HtmlCleaner Project Home Page. Does it mean asking Scott to introduce one?

By “CKEditor” I guess you mean the “XWiki CKEditor Integration” extension (https://extensions.xwiki.org/xwiki/bin/view/Extension/CKEditor%20Integration/) and not CKEditor itself (which is at version 5 for its latest version). A version 2.0 sounds good to me and would reduce the amount of backward compatibility “IF” we have to maintain (we’ll be able to remove quite a few probably).

vmassol · January 4, 2022, 1:10pm

No objection from me. It could also be a time to integrate CKEditor into XWiki Platform again to not have XS depend on Contrib. But that’s another topic

MichaelHamann · January 4, 2022, 2:24pm

The option is about the input format, which is either HTML4 or HTML5, not the output format. The output should be well-formed XML and thus XHTML5 when the input is HTML5.

There is a method setHTMLVersion in CleanerProperties in HTMLCleaner. There you can either set 4 or 5 as HTML version which changes which tags are allowed in which contexts. My idea was to simply pass this HTML version as a new htmlVersion parameter in the parameters of the HTMLCleanerConfiguration.

Yes, that’s what I mean.

vmassol · January 4, 2022, 2:31pm

ok that’s great then.I guess they forgot to update their web site doc.

mflorea · January 4, 2022, 2:53pm

I’d rather make it backwards compatible by checking if the XWiki instance supports HTML5 parsing. Maintaining 2 branches has its own cost.

I’m not aware of any other code using that script service. It was designed to be used by any WYSIWYG editor integration, but we have only one integration ATM (CKEditor). The old GWT-based editor is unmaintained and not used.

I’m OK with the change but we should document it in the release notes (as a breaking change, even if the script service API doesn’t change).

Thanks,
Marius

tmortagne · January 5, 2022, 8:44am

I’m not a fan of a version 2 either, but not for the same reason: I would suggest instead to move CKEditor to xwiki-platform (which is actually something we discussed and more or less agreed on a long time ago). But ultimately my preference goes to whatever @mflorea prefer since he is the one who do most of the maintaining on it.

vmassol · January 25, 2022, 1:04pm

How will that work since we’ll now be generating HTML5 tags with <figure> elements in XS? Any extension that uses the HTML macro (or HTMLCleaner) with the default value on content generated by XS will fail (<figure> will be stripped).

Thx

MichaelHamann · January 25, 2022, 1:25pm

We can fix the HTML macro, that’s what XRENDERING-509 is about. There are other places, though, like the HTML diff that would need to be fixed, too.

I’m not against switching to HTML 5 as default, keeping the default at HTML 4 just seemed the safest option for now. The main breakage I expect is that it transforms <tt> into <span class="monospace">. Apart from that it should primarily allow more elements and transform less elements (<i>, <u> and some other elements are no longer deprecated).

I see a couple of solutions for handling these breakages:

Implement parsing support for <span class="monospace"> in the xhtml/1.0 parser. Note that tags like <i> are already supported and handled like the tags we previously converted them to so I do not expect breakages there.
Switch HTMLCleaner to HTML 4 at least in the html/4.01 parser.
Extend the HTMLCleaner configuration to not to transform <tt>-tags. This is a bit ugly but possible using public APIs of HTMLCleaner. We are doing something similar at the moment to work a bug with SVG-handling. The <tt>-tag is marked as obsolete in HTML 5 and must not be used by authors but my proposed XHTML 5 parser would still inherit support for it so it is no requirement for parsing that we remove it in HTMLCleaner.