How to batch import thousands of markdown (or office) documents?

I see that there is this batch import application Batch Import Application (XWiki.org)
But it only allows to import Excel or CSV files.

What is the recommended way in Xwiki to batch import external (non-xwiki) documents?
Because of its importance, I think its a good feature request.

This may be associated with the future strategy/roadmap of Xwiki in order to allow more integrations and API’s?

I think it depends on the source and type of these documents.

For example if you want to batch import GitHub pages, you’ll use the https://extensions.xwiki.org/xwiki/bin/view/Extension/GitHub%20Importer%20Application/ extension. Same idea for confluence, mediawiki, etc.

Generally speaking the strategy is to use the Filter Stream application and implement an input Filter, see:

Thanks

1 Like

Thank you @vmassol I’ll have a look at how these File Streams and Filter module work. They appear not easy for end-users though.

Although this is a general question, my specific use case is the import of many Google drive (and Google Workspace) documents. The default way that Google allows to download these Gdocs are as .docx format files. I need to convert these documents to Xwiki pages.

I saw the (paid) Google Apps integration app,but that does not seem to convert the docs to Xwiki pages.

Would the File Streams Converter be useful for this use case?

Pandoc contains (also) a docx to xwiki converter. E.g. (after having installed Pandoc):

pandoc -s file.docx --wrap=none -t xwiki -o file.xwiki

The resulting output in xwiki format can then be uploaded using the XWiki REST API (this could be automatized e.g. using Python), or copied and pasted manually into the XWiki editor.

Caveat, however (with all such conversions) is that Word contains so many awkward hidden formats that there’s nearly always a cleanup necessary because Pandoc (as any other tool) can’t interprete everything correctly. I suppose that’s one of the reasons why this isn’t offered by the XWiki Google Apps integration.

2 Likes

If you man that the UI for the filter stream app is not easy to use, then yes, it’s mostly because the app is generic. Specific extensions can develop some custom UI to make it simpler to use.

So for converting office files I don’t think there’s any batch import on extensions.xwiki.org but I could be wrong, I don’t know the domain well.

It’s possible that the XWiki SAS company has done some work for their clients in this area though. You could try to contact them I guess.

1 Like

For completeness if info, pasting here there the link to the related Jira issue
https://jira.xwiki.org/browse/XWIKI-14571

Who decides a jira issue to be minor, normal or major?

@Lukas I experimented with Pandoc, and the conversion between docx and xwiki has issues, which I will post in a new topic, I think that it may require some attention. The pandoc from docx to commonmark goes well, and then inside xwiki the syntax switch to xwiki works too. See Pandox docx to xwiki has several issues

@Yuri The Batch import application seems to be broken. It used to work pre v.14.8 or so. Now the macro fails to execute:
Screenshot 2022-10-30 185227