New contrib repository + design feedback request - Translator Extension

slauriere · June 5, 2024, 12:30pm

Hello everyone,

I’m currently developing a Translator Extension that will enable users to translate content using an external machine translation service such as DeepL and store the output in the wiki. I have created a feature design page for it (with a note on how it will obsolete the existing DeepL extension). That’d be great to have your take on it and possible suggestions.

The extension is sponsored both by the European Data News Hub, which features news articles from several European news agencies in nine languages, and by the International Identifier for serials (ISSN) – intergovernmental organization which manages the identification and the description of serial publications.

While still working towards consensus and gathering improvement ideas, I would need to release an initial version as soon as possible. Hence may I already ask for a contrib repository and a Jira project? The extension name can be “translator” if that works for you.

Stéphane

MichaelHamann · June 6, 2024, 7:49am

That sounds like a great addition to XWiki! I think translator sounds like a good project name.

Do you plan to offer translator integrations beyond DeepL? I’m thinking in particular about services like LibreTranslate, an open source, self-hostable machine translation API.

How do you plan to preserve XWiki syntax during the translation? I did some experiments with LibreTranslate, and it basically destroyed a lot of syntax, randomly loosing or duplicating syntax elements and translating parameter names like “width” or “height”, giving results like [[Bild:CollectionsOverview.png||alt="Collections Übersicht, zeigt zwei Kollektionen in einer Tabelle" Höhe="421" Breite="960"] (the second ] was indeed lost). I just did a similar test with DeepL, and it seems better, but it still replaces quotes, leading to syntax like [[image:CollectionsOverview.png||alt=„Collections overview, showing two collections in a table“ height=„421“ width=„960“]]. LibreTranslate has some support for HTML, but in a test it lost HTML comments which would make it a bit useless for our annotated HTML, but it could still be an interesting solution - if necessary, we could simply improve LibreTranslate, taking advantage of the fact that it is open source. In any case, I have the impression that properly preserving formatting is a major challenge.

I also tried translating with LLMs and what I found is that they are much better at preserving XWiki syntax and recognizing which parts could/should be translated, but the translation is sometimes less direct, and LLMs tend to invent content, in particular when there is, e.g., an unfinished sentence in the original content. I had the impression, though, that an LLM-based translation could be a great input for human review. Further, it is relatively straightforward to provide an LLM further instructions about, e.g., the style of the translation, how to translate certain terms etc. That said, using LLMs comes with a heavy performance/computational resource cost, and it is also not straightforward to deal with the context limit, even with generous input limits, many LLM providers limit output to 4096 tokens, making it much more difficult to translate longer pages. So while there are some nice advantages, it is absolutely non-trivial to get a robust translation system based on LLMs that can be used in practice.

I also had some thoughts about using LLMs for updating translations where I think they could be a big advantage. We could provide both the existing translation and the update, e.g., as diff with some context, as input, and then the LLM could produce an update that takes the existing translation into account to only change what was actually changed and to keep the style of the translation or how certain words have been translated in the updated parts.

mleduc · June 6, 2024, 8:36am

+1 for the name as well.
I’m also curious about the extensibility for other tools, as well as the syntax preservation aspect

The language switcher could almost go the XS when multilingual is on.

tmortagne · June 6, 2024, 8:56am

I’m fine with the name too and create the github repository and Jira project.

vmassol · June 28, 2024, 9:09am

I feel that “Translator” is not a great name as it’s not very descriptive. Translator of what?

Also I’ve just in jira that we have 2 extension named very similarly. what are the differences?

Thx

vmassol · June 28, 2024, 9:10am

To answer myself it seems related to this repo:

This shows that the naming is not great.

Since the desc of the translator app is about machine translation, it should contain that info IMO, something like machine-page-translation if it’s about translating pages. If it’s about translating any type of content (pages, xobjects), maybe machine-content-translation?

slauriere · June 28, 2024, 12:03pm

Thanks Vincent and everyone for your feedback about the name (as for the implementation, I’ll follow up on Michael’s feedback in a few minutes).

I agree that introducing machine-translation in the name be more self-explanatory. The extension will allow to translate both page content and page object fields. What about using simply machine-translation, since the term Machine translation refers to “approaches to translation of text or speech from one language to another”, hence encompassing the notion of “content”. In case we cover speech in the future, we could introduce two submodules: machine-translation-text and machine-translation-speech. What do you all think?

That’d be great if we could reach an agreement today if possible because its sponsors expect it to be available by the end of June. However I understand very well it’s important to get a consensus before the first release, please let me know what’s doable.

MichaelHamann · June 28, 2024, 12:16pm

+1 for machine-translation, sounds good to me. I wouldn’t add page or content as it doesn’t seem self-explanatory what it means - for me page would be everything including attachments and XObjects while content could be understood as just the content field.

How do you store translated page object fields? XWiki currently doesn’t support translating XObjects.

slauriere · June 28, 2024, 12:50pm

I’m answering to the aspects of extensibility, syntax preservation and LLMs, thank you Michael for your remarks, and apologies for the late answer.

Extensibility

Do you plan to offer translator integrations beyond DeepL? I’m thinking in particular about services like LibreTranslate 1, an open source, self-hostable machine translation API.

Yes, this is the plan. I have introduced a Translator interface implemented by an AbstractTranslator which can be specialized: for instance a DeeplTranslator component is provided by a distinct extension and delegates the translation to the Java DeepL Translator Client using DeepL TranslatorOptions. There’s also a generic Usage interface for reporting the number of translated characters in the current period and the period limit.

Each Translator web service may have their own way grant access to a remote call. For now the extension supports simply a token, probably other methods should be supported in the future.

The current Translator interface is located here. It can certainly be improved, maybe I’ll add the @Unstable annotation before the release. It includes in particular the ability to take into account a translation glossary when performing a translation thanks to @Josue and @caubin’s work, and also the ability to choose several strategies when computing the translated document reference (same approach as with Entity Name Validation Strategies). Two strategies are bundled in the extension:

Same location strategy: translated documents are stored at the same location as the original document (just as XWiki does)
Prefix location with language strategy: eg “/en/company” becomes “/fr/societe”

The app configuration also allows to force the “same location strategy” for specific classes (eg GlossaryClass), when the language prefix strategy is used elsewhere.

It allows also to declare class fields references that should be translated, eg XWiki.Help.Movie.MovieClass^plot (relevant only when the translations are stored in different locations).

Syntax

Regarding the syntax: with DeepL I experienced that some syntax was not preserved either. Not much is destroyed, but sometimes double brackets are turned to one. Hence we chose to turn wiki syntax to annotated HTML and then conversely. The Translator interface has a boolean parameter for specifying whether the content is html or not.

LLMs

Indeed, using LLMs would certainly be very interesting. DeepL offers the possibility to choose between a few styles (automatic, formal, informal), but it’s far from offering the customisability of LLMs, and the combination of services (summarization etc.). It will be interesting to compare the approaches and the results.

Macros

We will have to consider how we want to handle macro content and macro parameters: some of them will have to be translated, while some others should not. I’ll expose issues related to this asap.

Thanks again

slauriere · June 28, 2024, 12:55pm

Indeed, could be nice. I created TRANSLATOR-1 for the record.

slauriere · June 28, 2024, 1:06pm

Indeed. The option to translate page object fields is meant to be used only when the “language prefix naming strategy” is used. I created TRANSLATOR-2 to improve the administration along this line.

slauriere · June 28, 2024, 1:25pm

OK cool thanks. Any objection to this @mleduc, @vmassol? I think Thomas is unaivalable today. Should we wait for his approval or is it acceptable to change the repository name and JIRA key in case every one agrees except him yet? Sorry for the rush… In case he has a strong point, we can still change the app identifier and add the previous one as a feature, am I correct?

mleduc · June 28, 2024, 2:28pm

lgtm, +1

MichaelHamann · June 28, 2024, 2:35pm

Changing the module name is a bit difficult. While you can indeed add the previous one as feature, this only really nicely works for dependencies. For upgrades, the extension manager won’t automatically propose upgrading to the extension with the new name. When the user installs the extension with the new name, though, it will be offered as an upgrade of the old extension.

vmassol · July 1, 2024, 9:10am

yes I’m fine with the name machine-translation and a similar name for the glossary module. If the glossary extension has not been released, you don’t even need a feature, otherwise, yes you need a feature.

slauriere · July 1, 2024, 9:33am

Thank you Vincent. @tmortagne : would machine-translation work with you as well, and if it does could you please rename the repository and the Jira project?

The “MT” acronym for “Machine Translation” seems to be in use so it good be a good fit as a Jira key, as you wish.

tmortagne · July 1, 2024, 9:49am

Done renaming the github repository and changing the jira id.