Macros parsing in Cristal

This post is more of an announcement than a proposal, to explain the choices we are making for the macros parsing system in Cristal.

Currently, Markdown is parsed through Unified and Remark, the latter using Micromark under the hood. We provide the textual input, and get a fully type-safe AST, which we then recursively iterate over to generate a Universal AST which can be used with any supported editor, any design system, and any framework.

The problem we currently have is how Remark handles parsing nested elements. When we iterate over AST nodes, we find text nodes and search for the {{macro /}} syntax in it. This works fine, except for some very important cases: if the macro syntax contains any Markdown-specific symbol in it, it throws off the parser.

As an example, the following content:

{{macro attr="value_a_b" /}}

Will fail because Remark will first parse _a_ as italic text, and then {{macro attr=”value and b/}} as surrounding text nodes. Result: we get three text nodes instead of one, and in some scenarios this becomes way too complex to handle.

Also, the following content would also cause problems for the same reason:

{{macro}}
content_a_b
{{/macro}}

To avoid this problem, we first decided to go with a custom Remark plugin, which turned out to be much too complicated to do. Macros have a very complex syntax compared to basic Markdown elements, having the ability to set a list of attributes, each attribute having the possibility of using escaped character (e.g. attr=”Some \” value”), plus the two different syntaxes for self-closing macros ({{macro /}} and contentful macros ({{macro}}Content{{/macro}}).

For reference, the Remark plugin for parsing tables (which don’t even support nesting nor escaping), is about 1300 lines of code. For us, this would likely mean 1500+ lines to write a proper plugin, due to three things:

  • Remark is not just a parser, but a Markdown processor. Parsing custom content requires making both a tokenizer (Markdown to list of tokens), and a syntax tree producer (tokens list visitor to abstract syntax tree)
  • As stated above, macros have some pretty complicated syntax, supporting both escaping inside attributes and nesting inside their content
  • We need to parse macros syntax after code blocs (as a macro syntax inside a Markdown code block is not actually a macro), but before other elements such as bold and italic styling, lists, tables, and so on.

For those reasons, we decided to go with another approach, akin to a “hack”, but much more simple to implement:

  • When provided a Markdown content, the parser will iterate over it and find all the macro usages in it, ignoring code blocs.
  • All macros are parsed properly (returning appropriate errors if ill-formed), and turned to HTML comments with a prefix (e.g. <!– @cristalMacro: {...some object representing the parsed macro...} -->)
  • The content is then parsed normally through Remark
  • When visiting the AST, comments starting with the prefix (e.g. @cristalMacro) are decoded and injected into the final UniAst object

This may seem more tedious, but it’s actually much more simple to do. As a comparison, the current macros parsing system (which does not support content but does support attributes and escaping) is less than 150 lines of code. Adding nesting support + comments conversion would probably lead it to about 250~300 lines, which is much more compact. It would also be a lot easier to read and maintain, not using complex third-party APIs for parsing and processing.

This also ensures we’re decoupled from Remark, and if we ever need to switch parsers, we could properly re-use that part instead of rewriting everything from scratch. Performance-wise, this approach is probably even faster than Remark’s plugins, not having to perform lots of function calls.

So, after this long explanation, we’re going forward with the “hack” method to have the benefits of a lot easier and faster development, easier maintenance, and partial parser-agnosticity.