New contrib repository + design feedback request - Translator Extension

Hello everyone,

I’m currently developing a Translator Extension that will enable users to translate content using an external machine translation service such as DeepL and store the output in the wiki. I have created a feature design page for it (with a note on how it will obsolete the existing DeepL extension). That’d be great to have your take on it and possible suggestions.

The extension is sponsored both by the European Data News Hub, which features news articles from several European news agencies in nine languages, and by the International Identifier for serials (ISSN) – intergovernmental organization which manages the identification and the description of serial publications.

While still working towards consensus and gathering improvement ideas, I would need to release an initial version as soon as possible. Hence may I already ask for a contrib repository and a Jira project? The extension name can be “translator” if that works for you.

Stéphane

That sounds like a great addition to XWiki! I think translator sounds like a good project name.

Do you plan to offer translator integrations beyond DeepL? I’m thinking in particular about services like LibreTranslate, an open source, self-hostable machine translation API.

How do you plan to preserve XWiki syntax during the translation? I did some experiments with LibreTranslate, and it basically destroyed a lot of syntax, randomly loosing or duplicating syntax elements and translating parameter names like “width” or “height”, giving results like [[Bild:CollectionsOverview.png||alt="Collections Übersicht, zeigt zwei Kollektionen in einer Tabelle" Höhe="421" Breite="960"] (the second ] was indeed lost). I just did a similar test with DeepL, and it seems better, but it still replaces quotes, leading to syntax like [[image:CollectionsOverview.png||alt=„Collections overview, showing two collections in a table“ height=„421“ width=„960“]]. LibreTranslate has some support for HTML, but in a test it lost HTML comments which would make it a bit useless for our annotated HTML, but it could still be an interesting solution - if necessary, we could simply improve LibreTranslate, taking advantage of the fact that it is open source. In any case, I have the impression that properly preserving formatting is a major challenge.

I also tried translating with LLMs and what I found is that they are much better at preserving XWiki syntax and recognizing which parts could/should be translated, but the translation is sometimes less direct, and LLMs tend to invent content, in particular when there is, e.g., an unfinished sentence in the original content. I had the impression, though, that an LLM-based translation could be a great input for human review. Further, it is relatively straightforward to provide an LLM further instructions about, e.g., the style of the translation, how to translate certain terms etc. That said, using LLMs comes with a heavy performance/computational resource cost, and it is also not straightforward to deal with the context limit, even with generous input limits, many LLM providers limit output to 4096 tokens, making it much more difficult to translate longer pages. So while there are some nice advantages, it is absolutely non-trivial to get a robust translation system based on LLMs that can be used in practice.

I also had some thoughts about using LLMs for updating translations where I think they could be a big advantage. We could provide both the existing translation and the update, e.g., as diff with some context, as input, and then the LLM could produce an update that takes the existing translation into account to only change what was actually changed and to keep the style of the translation or how certain words have been translated in the updated parts.

+1 for the name as well.
I’m also curious about the extensibility for other tools, as well as the syntax preservation aspect

The language switcher could almost go the XS when multilingual is on.

I’m fine with the name too and create the github repository and Jira project.

1 Like