PDF export for Javascript-generated content

Hi devs,

Context

I’d like to discuss how we could improve our PDF export so that exporting content that uses javascript that modifies the DOM will work. For example, consider the {{chartjs}} macro which displays some graphs using javascript.

Right now we render the page on the server side using the XHTML renderer, convert this XHTML to XSLFO and then ask FOP to convert it to PDF (not mentioned various substeps). This means that we don’t execute javascript and thus we don’t get the modified HTML DOM in the export, which is a big limitation.

Proposal 1: Headless Chrome

I had an initial discussion with Marius and Thomas and this is the idea we came up with so far:

  • Run Chrome in headless mode on the server side. In order to make this portable across OSes, use a Chrome docker image for that.
  • Control this Chrome instance from XWiki by using some java API to do so.
  • More specifically when exporting a page, ask Chrome to load the page URL and then to export it to PDF

There are some challenges though, some are listed in a previous discussion we had on this topic back in 2017.

Let’s list what I have in mind (please add more if you can think of more):

  • The chrome loading of the XWiki page must be authenticated with the same use so that the rendering produces the same result as the page being viewed.
  • Handling of multipage exports. I can think of 2 solutions:
    • Prepare a transient XWiki document representing the full content to export (using {{display}} macros for all pages) and ask Chrome to export that page.
      • Problems: This will create a page in the wiki which will need to be deleted afterwards, not nice. It could also lead to a very heavy page (imagine wanting to export a full wiki to PDF).
      • Possible solution: create a service that does the concatenation
    • Loop over all pages and for each one, ask Chrome to generate a PDF for it. Then merge the PDFs using a java library such as PDFBox.
  • Making sure the JS has finished executing before Chrome converts the page to PDF.
    • They solved this in this article again, where they mentioned having to modify their code so that their JS can tell when the content is ready. Open question: that’s fine for our code but how do we handle 3rd party code? How do we know it’s finished loading? Do we have such cases?
  • Table of content and page numbering (also in the context of multipage exports)
    • The TOC could be solved by generating it server-side and then inserting it in the PDF using PDFBox or some other java library.
    • Same for the page numbering I guess. Even if there are options for controlling page numbering using the CDP protocol for Chrome, we would still need some manual updates to handle multipage exports if we use the PDF merging approach.
  • Lots of other small details that they solved in this article

Any other issues you see that are not already listed?

Conclusion

While this approach is far from simple and brings lots of questions, it has the nice advantage of relying on Chrome which is a well-maintained piece of software, fixing bugs and improving at each release, regularly. This seems a much better approach than FOP which is moving a lot more slower IMO. So I believe this approach has a higher change of generating higher quality PDFs (if we can solve all the points mentioned above).

If we implement proposal 1 and succeed in solving all listed items, would you be ok to bundle this in XS (maybe turned on only when there’s Docker installed on the server, and ideally making the standard FOP-based code into an extension so that admins could decide which one to install/uninstall to control how many export to PDF buttons users will see in the export UI)? Even if not bundled by default in XS, would you be ok to make it an extension part of platform and thus officially supported by the XWiki dev team? On my side, I think it would be a good thing since it would solve an important issue we currently have (JS not being executed).

WDYT? Do you see an alternative that would work better?

Thanks
-Vincent

One more:

  • Links.
    • I checked, and we do get clickable links (by opposition of what was said back in this thread).
    • However they’re displayed with both the link label + the URL next to it. Screenshot 2021-09-15 at 16.40.51
      • One solution for this could be to use PDFBox or some other library to modify the generated PDF: visually mark the link + remove the URL. I also tried to google to find if chrome had some settings for how it displays links and couldn’t find anything so far.
    • There’s also the issue of exporting several pages and having internal links between these pages.
      • This is not easy and again, it could mean parsing the generated PDF(s) to find internal links and replace them (using PDFBox for ex).

Note that replacing text in PDF using PDFBox doesn’t seem to be easy, see Apache PDFBox | PDFBox 2.0.0 Migration Guide

See also itext - Search and replace text in PDF using JAVA - Stack Overflow

Proposal 2: Save as image

Another proposal would be to modify each macro or feature that uses Javascript to generate an image of the result and save it server-side, and then to include that image when printing to PDF.

Note that we would still need to modify the code to know when the JS has finished adding DOM elements so that image can be saved server-side.

The drawbacks are:

  • Each macro or feature using JS needs to be modified to support PDF export
  • The quality of the image in the generated PDF would probably be quite low
  • Performance problem since the save would be done on each rendering
  • How would it work for multipage exports since each page is not displayed in this case and thus no JS is executed…

Pros:

  • Simpler to implement for a small subset of use cases
  • Can also work for exporting to PDF using the LaTeX extension (the LaTeX extension operates only on the XDOM and not on HTML, except for the HTML macro which is parsed to XDOM).

Thanks for the full description of the options. I’m a bit surprised that there’s no solution nowadays to allow activating “print to pdf” from native browser print features: I googled a bit but apparently we can only trigger the print window without specifying any options. That would have been best for this feature IMO.

I see one possible issue with headless chrome:

  • interactive javascript that would not be displayed the same than in client browser. For example, you have a charts with 3 graphs and the user activated only one client side (others are hidden): I don’t think we’d have an easy way to capture that.

Now besides that I’m actually a bit afraid with the headless chrome solution: our experience testing XWiki with Docker, and with using browsers on docker is not that good, and even if in that case it would be definitely simpler, I’m still afraid that would be an heavy machine to maintain.

For that reason I think I prefer Proposal 2, and for that one I guess some drawbacks could probably be solved by maintaining a cache of images for each macro: so basically providing an API to save the images on the server, and automatically cache them. We’d still need to modify the existing macros to generate the image and call that new API, but it would solve performance issue, and possibly multipage issue too.

So I’d be +1 that we focus on providing such API, and starts modifying the most used macros for which we want an export (e.g. chartjs). I’m -0 for proposal 1, since it might allow export of all JS, but it sounds like a really expensive solution, and I’m not sure we’d need it.

It wouldn’t work for multipage exports, and not all browsers support printing to PDF and not with the same quality if they do. Hence the headless chrome idea. See also http://markmail.org/message/ztcwibiuoqfjcnjo where we discussed this in the past.

The problem with proposal 2 is that it cannot work well enough simply because of the quality aspect (PDFs can be zoomed, printed and images will result in poor results). Also, how would you handle multipage exports?

Is this really a problem? I don’t think so.

Proposal 3: Use the browser’s Print function

Another proposal is to use the browser’s native print function. The flow would be like this:

  • The user choose “Export” from the more actions menu and then “Export as PDF” from the modal which open the “PDF Export Options” page.
  • The user fills the PDF options and clicks on “Export”
  • At this point some JavaScript code kicks in and:
    • adds a loading animation
    • computes the print URL based on the PDF options
    • injects a hidden iframe into the current page pointing it to the print URL
    • waits for the iframe to load and also for the JavaScript within the iframe to be ready (see Proposal 1)
    • removes the loading animation
    • calls iframe.contentWindow.print()
    • the user is free to either download the PDF or send it to a printer

The service behind the print URL should return HTML matching the print options. It should be able to:

  • aggregate child pages (or the pages specified in the URL query string)
  • generate a global ToC (based on the aggregated pages)
  • replace links with anchors when the target page is included in the print

The problem with link URL displayed after the link label remains but we could, at worse, if we can’t find an alternative, drop the link.

Actually that’s easy to fix:

@media print {
  a:link:after, a:visited:after {
    content: none;
  }
}

The service is what makes the more sense IMO, and it could be useful for other use cases (exporting the result as PDF being just one way to exploit the result of this service).

Can be covered by the service too.

Note that the print service I mentioned in Proposal 3 would behave in a similar way as the current code that does PDF export on the server, only that instead of transforming the HTML to XSLFO it would return it back to the browser. This basically means extending our “Print Preview” feature to support all configuration options supported by “Export as PDF” (cover, table of contents, etc.).

So it seems this is essentially the same solution as proposal 1, the difference being that the browser used is the user’s browser. And yes for multipage and content manipulation, we should indeed use an action as we do for print ATM to stream the full content to the caller. However, it’ll mean having the browser load it fully in the iframe which means it won’t scale for large PDF exports (like exporting a 1M pages wiki to PDF for example - taking an extreme UC here ;)). In any case it seems much simpler to control the HTML sent rather than operating on the produced PDFs and merging them (seems like manipulating PDF is not that easy after all).

ATM I still prefer proposal 1 but with the print service that you mentioned, simply because we control the browser and the quality output. It’s heavier though. So maybe we could do both since the only difference between both approaches is the iframe vs CDP call, and the rest is the same.

Doing both is interesting since we could offer it OOB with the user’s browser and have an extension to make it work with Docker and Chrome headless in case:

  1. the user’s browser doesn’t support printing to PDF,
  2. the user’s browser doesn’t produce nice PDFs
  3. the user is on mobile
  4. the user’s OS has less power (CPU hungry maybe for large exports?) than doing it on the server machine

WDYT?

We should definitely go for 3 as a default. It could be interesting to propose as extension to do the export through a dedicated headless browser instead of the current one, but not sure it’s really critical for the feature.

+1

Does FF support printing to PDF natively? I googled quickly and it seems it doesn’t, only through extensions. For Edge it seems supported (since it’s now based on chrome I guess it’s very similar in output).

Answering to myself: it does (just tested).

However the results are different between FF and Chrome and it doesn’t seem it would be a good idea for an xwiki instance to generate different PDFs depending on whom does the export… So I believe the chrome headless solution is more important than what you may think and not something optional to me. It’s quite likely that replacing our current PDF export with the local browser one is going to be considered by several users are a regression on this aspect and it’ll raise questions. We’ll have a solution (install the chrome headless extension) but it’s a bit less seamless than what we have now. In exchange you get better PDF fidelity.

Screenshot 2021-09-16 at 15.09.55

Also, as I mentioned above we don’t have to abandon the FOP-based solution immediately and we can offer the various options in parallel for some time, we just need to make it configurable.