Formalisation of Cristal storage format

Hello all,

For the file system and NextCloud storage, Cristal currently stores files at the root of a directory:

  • ~/.cristal for the file system
  • a .cristal directory at the root of the user’s storage for Nextcloud

In both cases, the content of the directory is the same:

  • one directory per page
  • a page.json file containing the content of the page as well as all the metadata (e.g., syntax, title, author…)
  • sub-pages are stored in subdirectories
  • attachments is a special keyword, attachments of page X are stored in X/attachments/

The current storage suffers from several limitations:

  • if a user creates an attachments page, there is a potential collision
  • users accessing the storage externally can’t easily read the page content

Note: synchronization of content edited externally is currently out of the scope.

Proposal

To address those limitations, I propose the following.

File and pages names

Option A

  • page and sub-pages are stored as directories and subdirectories, the page content is stored in a page.md file (e.g., mypage/page.md, or some/path/page.md)
  • attachment directories are now named .attachments
  • a page name can’t start with a dot

Option B

  • spaces are directories, but pages are stored as ${pageName}.md, in this case attachments should be stored in .attachments_${pageName} subdirectories.
  • a page can’t start with a dot (to avoid conflict if someone names a space .attachments_existingPageName)

Page format

This section discusses specifically how we store the page content and its metadata.

Option 1 — Embedded metadata

In this case the metadata are stored directly in the markdown file.
A common format is to have a header with json or yaml content.

---
title: My Page
creation-date: YYYY-MM-DD hh:mm
---

# Page H1

Some content....

This is quite common, this is for instance what Hugo or Jekyll

Cons:

  • if a backend support markdown but not header metadata, they will be displayed, breaking the rendering
  • based on how complex are the data we want to store, having them in the header might not scale

Option 2 — Embedded metadata

A variation of option 2 is to store the metadata at the bottom of the page.
In case of lack of support from the backend, they metadata would still pollute the rendered page, but only after the main content.
The cons being that accessing the metadata required to read the whole page content which might not be optimal.

Option 3 — Side file

In this case, a pageName.json file would be stored aside the pageName.md file, storing the page metadata.
Pro:

  • This solves the support from markdown header by the backend
    Cons:
  • This makes the move/rename page logic more complex and increases the changes of users breaking the storage by manually moving the markdown file without its json metadata file

Option 4 — Side directory

This is following the same logic as Option 3, but with a directory instead of a file.
In this case a .pageName_metadata directory would be stored aside the markdown file (same logic as Option B for the attachments).
This directory would itself store metadata in files.
The pro being that each extension could choose to store its own files (e.g., .pageName_metadata/extensionId/data.json), making the storage design more extensible.
Note: if we consider the attachment storage as an extension, attachments could even be stored in .pageName_metadata/attachments/.... But we can consider attachments are a primitive concept with it own rules, to make them easier to access in the architecture.

Do you see something missing?
What is are preferred options?

My choice, before discussion is:

  • Option B
  • Option 4 with a special handling for attachments

I feel like this is the combination leading to the most robust storage architecture while allowing users accessing the storage externally to easily understand the files structure.

Thanks for reading this far.

PS: Given the current state of the project, I’m -1 to introduce a migration tool from the old format to a new one, unless somebody explicitly ask for it.

You didn’t explain in this proposal how the directories are named exactly. If it’s the full reference of the page it might quickly be a problem: those references can be long, exceeded the path of some filesystems.
AFAIR that’s exactly for this reason that we have a specific computation of the path for storing the attachments in XWiki.

1 Like

For the location in the file system, please follow the XDG base directory specification, so for the data directory, it should be $XDG_DATA_HOME/cristal with a default of $HOME/.local/share/cristal in case $XDG_DATA_HOME isn’t set.

Regarding the directory layout, let me explain the layout that is used by DokuWiki:

  • All data files are by default in a data subdirectory (the location can be configured).
  • Page contents are in data/pages, filenames are in the form data/pages/space/subspace/pageName.txt.
  • Metadata is in data/meta/space/subspace/pageName.meta, with various other metadata files also existing that have different extensions (for example, there is .changes that contains the list of revisions as CSV).
  • Media files (so attachments) are in data/media/space/subspace/attachmentName.ext (where ext is the extension of the file). Note that media files are independent of wiki pages, they aren’t “attached” to a page.
  • Media metadata (like the list of revisions) is in data/media_meta/space/subspace/attachmentName.ext.changes.
  • Old revisions are in data/attic/space/subspace/pageName.TIMESTAMP.txt.gz, the compression is configurable and TIMESTAMP is the timestamp of the revision (revisions are identified by timestamp, there are no extra revision numbers in DokuWiki).
  • Old media file revisions are in data/media_attic/space/subspace/attachmentName.TIMESTAMP.ext
  • DokuWiki uses _ as special character at the start that cannot be the start of a page name. So for example data/meta/_dokuwiki.changes is the global change log for pages and data/meta/_media.changes the one for media files. A page named _template.txt is used as template in the current space, a page named __template.txt is used as template recursively in the current space. Users aren’t that happy with the template storage, though, as they cannot be edited through the UI, so this might not be a design to follow.

DokuWiki provides configurable encoding of special characters and non-ASCII characters in filenames as depending on the file system they might not be supported. In the data/meta there is also a list of all recent revisions that is truncated from time to time to provide a way to show a list of the most recently changed pages.

I’m not saying that DokuWiki’s file storage is ideal, but I think it’s an interesting example. Things I would do differently:

  • Store metadata in the page itself. I like option 1 as it is also commonly used by other tools as far as I know. The advantage of this vs. other options is that it is easy to have proper revision support for metadata - something that is missing in DokuWiki. That said, having a separate metadata directory to store properties that shouldn’t be edited directly and doesn’t need to be versioned (like the list of revisions or the creator of the page) could still be an interesting option.
  • Like XWiki, DokuWiki differentiates between terminal pages and pages that are the home of a space. While this makes sense from the storage format, I think this distinction isn’t useful for users, and it would make sense to not support “terminal” pages. Now I haven’t checked how existing note-taking tools handle this, a possibility would also be to treat space/subspace.md as the home of space/subspace which might give a more natural directory hierarchy.

Also, I would advise against using . as the start of any directory that contains meaningful data as by default it will be hidden in file managers. I would rather suggest _ as in DokuWiki. Instead of forbidding this character as the start of a page name, you could also just encode it. However, note that file names are generally limited to 255 characters, encoding can severely limit this. Still, to increase compatibility, I think you should think about an encoding scheme.

On Windows, there is a limit to 260 characters in total (!) in the path. NTFS and exFAT support 32k characters, though, and apparently it’s only a matter of using the right APIs. On Linux, no such limits seem to exist.

In general, if you want to have human-readable file names I would suggest to still use the actual page and space names in the path, with some cleaning (don’t allow ..!) and some encoding. While using hashes as we do for attachments increases the compatibility, it makes the plain storage inaccessible for humans.

I have another note regarding the “Title” metadata you show: I wonder if it wouldn’t be more semantic to store the title as the first H1 heading as for accessibility/semantic reasons, the page title should be the only H1 and the page content itself should only use H2 and below.

This is cool but seems related to linux and the locations are very different on Windows. Just a reminder that you also need a directory structure for Windows.

Thx

1 Like

I did not discuss this as it’s outside the scope of this proposal.
But, yes, we need to move toward storing the directory in a standard place by default.
Though, another plan is to simply let users choose where they want to store their data, as Obsidian is doing for instance.

1 Like

There are equivalents to these directories on Windows and MacOS, and there is even a node.js library that provides platform-independent access to them. Apparently Electron also provides an API to get at least the configuration directory.

1 Like

This actually sounds best in my opinion, at least if the data is human-readable without Cristal.

I’m not a fan that the documents are stored in .local as this from my point of view is the place to strore configurations.
For me the cristal documents would need to be stored in the Documents area (whereever this is depending on the system).
Also users should be able to decide where the documents are if they want a different sub-directory or even a complete different directory.

For the metadata we could also have an hybrid approach and have some metadata stored using the embedded markdown format (title, creation date, author) and then have more complex metadata stored separately.

I would find that not good that very basic wiki files require a metadata file. This should be limited to more advanced documents.

1 Like

Thanks for the exhaustive explanations and experience feedback.

I’m missing something regarding attachment management in dokuwiki. Asking to make sure I’m not missing something important for Cristal.
It is only possible to upload files to existing namespaces. But, if the namespace becomes non-existing (i.e., all the pages of the namespace are removed), then the attachments stays there and can be used in other pages without trouble.
I understand why it is convenient to have attachments leaving in a separate space, but the absence of a mechanism to define the namespace when uploading an attachment, or a way to move attachments is surprising (though, it is very possibly me being unfamiliar with the UI of dokuwiki).

Overall I like the idea of having dedicated root directories, but I can think it makes it more difficult to understand where is what when using a file explorer or a standard code editor (i.e., without a Cristal specific editor extension).
As @ludovic said, I agree there is probably a single general solution here.
What we can do is:

  1. “basic” (to be defined) metadata are stored directly in the document
  2. more advanced metadata are stored separately

This is a good point. I need to check if a portable library exists for electron. For nextcloud, I’m hopping for the service to handle such issues for us.

This could be configurable. I think we should hide the notion of terminal pages by default, and only allow it when it is required to support backends exposing this concept (e.g., xwiki, or dokuwiki).

I don’t have a strong preference. But, for me using hidden files is also good for hiding complexity from users visiting files from the filesystem. Do you know if this was discussed for dokuwiki?

+1 here too I’ll need to check if an existing library addresses this issue. Otherwise, we’ll need to think of a solution as I believe it is important to keep human-readable names as much as possible.
We could maybe fallback to compressed names once the path becomes longer than what is supported by the current system.

Good point, for another discussion I believe. I need to think more about it to make sure this is not going to create issues for future more advanced usecases.
Though, from a usability and accessibility point of view this definitely makes sense.

Thanks a lot for all the answers. I’ll try to synthesize all this in a design page.