Confluence import: Broken emoticon import with new Confluence format

Hello!

While using the Confluence Migrator Application (Pro), I noticed that importing emoticons is not working as expected. The formatting of confluence exports seems to have changed.
The initial implementation was Tracked in CONFLUENCE-146 (Loading...). There were additions after the initial implementation, but the current implementation still expects a format like this:

<ac:emoticon ac:name="smile" ac:emoji-shortname=":slight_smile:" ac:emoji-id="1f642" ac:emoji-fallback="🙂" /

Source: confluence/confluence-xml/src/test/resources/confluencexml/emoji/entities.xml at d68bdf76dd8dd9cd14c6d5aa24752708433b4098 · xwiki-contrib/confluence · GitHub

But when exporting from the current Confluence cloud version, the exported format is like this:

<ac:emoticon ac:name="cross" ac:emoji-shortname=":cross_mark:" ac:emoji-id="atlassian-cross_mark" ac:emoji-fallback=":cross_mark:" />

Notice the change in ac:emoji-id and ac:emoji-fallback

For some reason, the new format is already mentioned in the original issue CONFLUENCE-146, but not tested inside the test of the implementation.

As seen in the code at confluence/confluence-syntax-xhtml/src/main/java/org/xwiki/contrib/confluence/parser/xhtml/internal/wikimodel/EmoticonTagHandler.java at d68bdf76dd8dd9cd14c6d5aa24752708433b4098 · xwiki-contrib/confluence · GitHub, the current implementation parses the emoticon properties in the following order: emoji-fallback, emoji-id, shortname, name. This worked for the old confluence format, as the emoji-fallback seems to always have contained a properly formatted emoji and the emoji-id was always a Unicode code point of the emoji. However in the new format, these got replaced by strings that are similar to the name property.

When importing into XWiki and using the first format this results in the Text “:slightly_smiling_face:” and when using the newer format it results in the Text “:cross_mark:”, but should result in “:x:”. This is because the fallback is evaluated first and was expected to contain the emoji as Unicode text.

There are multiple solutions to this, but I think the most robust would be to check while parsing the fallback inside sendFallback and check whether it is an emoji or a label like string. If it is a label, sendFallback should fail and the other methods are tried. The function sendEmojiId should work as expected, because the label should not get interpreted as a valid Unicode code point. It may be beneficial to also check inside sendEmojiId if the parsed code point is an emoji to be on the safe side. Besides the modification of sendFallback and sendEmojiId, it would also be good to switch the order of parsing to sendName, sendShortName, sendEmojiId, sendFallback, as in the official documentation for the Data Center version of Confluence, only the name field is mentioned: Confluence Storage Format | Confluence Data Center 9.3 | Atlassian Documentation

Incredibly nice investigation @Dev92, thanks for the detailed report

Yes, it seems like we should use ac:name first after all.

I reported this at Loading..., and I will fix this for the next release.

Thank you @rjakse for opening the issue.

I think we should also check in sendFallback, whether it is an actual emoji or a label by checking if the Unicode character is an emoji. The same can also be used for sendEmojiId. Otherwise, if sendName, sendShortName and sendEmojiId fail, it would use the fallback and this would be wrong with the new format. This case can easily happen if the name or shortName is not in the list of known names, for example when a new emoji is added to Confluence. Instead, if we check whether it is an emoji, and the new format is used, it would fail all functions and fallback to a confluence_emoticon macro at confluence/confluence-syntax-xhtml/src/main/java/org/xwiki/contrib/confluence/parser/xhtml/internal/wikimodel/EmoticonTagHandler.java at d68bdf76dd8dd9cd14c6d5aa24752708433b4098 · xwiki-contrib/confluence · GitHub which I think is the intended behavior.

@Dev92 I just pushed a fix: your confluence export would be correctly handled.

I think we should also check in sendFallback, whether it is an actual emoji or a label by checking if the Unicode character is an emoji. The same can also be used for sendEmojiId. Otherwise, if sendName, sendShortName and sendEmojiId fail, it would use the fallback and this would be wrong with the new format

Correct and good point! I made sure of this in this commit CONFLUENCE-388: Handle new Confluence cloud ac:emoticon tags in which… · xwiki-contrib/confluence@5c1fe2e · GitHub

It seems you easily navigated in the code of the migrator, feel free to contribute if you’d like to :slight_smile:

1 Like

Thank you @rjakse for implementing this fix so quickly :grinning:.

The fix looks good and should work for all known cases. It also seems that it is not so easy to check whether a string is only an emoji in java. If it is required to check for this in the future and relying on Java 21 is no problem, then the new Java 21 functions described in https://www.baeldung.com/java-21-improved-emoji-support could help.

I just don’t had the time to set up a working development environment for this, but I will consider contributing in the future if possible.

the new Java 21 functions described in https://www.baeldung.com/java-21-improved-emoji-support could help.

Ah nice! Although Java 21 in this code base won’t happen anytime soon I’m afraid, so yeah, I’m kinda hoping that we can rely on the fallback starting with ‘:’ in the meantime…

Of course, it might not be a very safe bet, given how they changed their export format in the first place xD

My time is quite limited these days so I don’t have the time to dig into this more, but if you find a way to check this more reliably in Java 11, I’ll be happy to commit it.

Thanks again for the wonderful report!

It’s always possible to use a Java 21 API through reflection, at worst. But in any case, the code still need a good enough Java 11 alternative when not running in Java 21+.