Headed text format

Contents:

Why headed text files
Structure of the file
Character encodings
Content formats

Why headed text files

Text files with a clearly separated header are used by Thalassa CMS as a source for the following types of content:

pages of page sets;
user comments.

Every file of this type serves as a source for exactly one page, or exactly one comment. In contrast with ini files, a headed file can have its own format and may specify its own character encoding. Macroprocessing is NOT performed on the content of these files, and sometimes this may be deadly important.

The idea behind these files is that they are intended to be created programmatically, rather than written manually, despite that it is easy to write them by hands as well. In particular, this documentation was prepared as a set of headed text files, written in a text editor (what editor? well, vim) and processed as a page set. From the other hand, comments are received with the Thalassa CGI program (thalcgi.cgi) and stored as headed text files, so this is an obvious case when they are created by a program.

Furthermore, Thalassa CMS was initially written keeping in mind a task of saving a particular site, which was initially created in 2009 with Drupal (5.* version of it, which was declared end-of-life just a year later). After several years, the version of MySQL used for that site was declared unsupported as well, and the ancient version of Drupal refused to work with later versions of MySQL or MariaDB. Even the PHP interpreters of newer versions started to output warnings trying to run that old Drupal. After working for more than a decade in a virtual box containing older versions of all involved software, it became clear for the site owner that the only way to keep the site running is to break out of this software slavery, built by people who don't care about backward compatibility.

This is how Thalassa was born; the site from this story is http://stolyarov.info, it is in Russian, so unfortunately most of people reading this text will not be able to rate it for themselves, but as of the moment of migration to Thalassa, it had 200+ pages and 6000+ user comments, and to give up all the content would be a catastrophe. What was actually done is all pages and comments were fetched from the MySQL database and stored in files; all these blocks, menus and the like, together with page layout and color scheme, were all recreated manually.

This makes another example when a set of headed text files was created programmatically, this time including the source files for pages, not only for comments. And here's why Thalassa allows these files to be in a character encoding different from the one used for ini files (and for the generated HTML files, too): the author prefers koi8-r as THE cyrillic encoding, while Drupal since ancient times is only capable of utf8, and author preferred to store the source information in files exactly how it was fetched from the database (in order not to loose anything accidentally).

The real objective of these files is to work instead of another damn database table, holding both the content (that is, the text of a page or a comment, possibly not in the final form) and additional attributes such as title, date, tags and the like. And yes, they perform well. Actually, they work far better than the author expected.

Structure of the file

The format is much like the well-known since rfc822 email message format: headers come first, and every header consists of a name, a colon char, a space and a value. Placing a line starting with whitespace right after a header line, one can continue the header line, thus making it longer. The body is separated from the header by an empty line (strictly speaking, a line either of a zero lenght, or consisting of whitespace chars only). The most notable difference with rfc822 is that header names are case-sensitive, and for Thalassa CMS they are usually written all-lowercase.

To get the idea, here's an example:

  id: testpg
  title: Test page
  unixdate: 1681890000
  encoding: us-ascii
  format: texbreaks

  This is an example of an HTML page created
  from a source stored as a headed text file.

  BTW, thanks to the format, this phrase will
  go into a separate paragraph.

The header fields id, encoding and format are used internally and are not accessible directly with macros. Actually, the id (both for a page and for a comment) is accessible, but it must be set to the same value as determined by the file name (or, for set items represented with a directory, by the name of the directory). It is unspecified what happens in case they don't match.

Actually, it is very possible nothing will happen at all. Please don't rely on that. In future versions, checks are likely to be added.

The encoding and format fields determine what filters must be applied to the body and some of the header files before the content is made available as macro values. Both can be omitted or left blank. If encoding is not specified explicitly, it is assumed to be the same as for the ini files (and for the site), so no recoding is needed. Character encodings are discussed later.

If format is not specified, it is assumed to be “verbatim”, so, again, no conversion is needed. It is notably inconvenient to write raw HTML manually; see the Formats section below for information on available formats.

Some of the fields are recognized by Thalassa, but their values are available via macros, directly or indirectly. All information from unrecognized fields is extracted and is available directly by the names of the fields (via the %[li: ] macro for pages, via the %[cmt: ] macro for comments), so the user can add any fields as needed and use their values for site generation.

Some of the recognized fields are not passed through filters at all (e.g., the unixdate: field, as it is just a number), some are passed through the same set of filters as the body (as determined by format: and encoding:; actually, only the descr: fields for pages is converted this way in the present version), and for the rest of the fields, including any unrecognized fields, only the encoding conversion is applied as appropriate, but no format conversion is done.

The exact list of recognized fields and their roles differ for pages and comments, so it is discussed along with page sets and comment sections, respectively.

Character encodings

Thalassa mostly acts in encoding-agnostic manner. In particular, it assumes ini files are written in the same encoding as the site pages should be generated, so it never performs any encoding transformations for all content that comes from ini files.

Unlike that, headed text files are intended to be supplied by other programs, and the author of Thalassa had at least one task at hands for which encoding conversion was desired. Hence Thalassa supports encoding conversions for headed text files' content. To activate this feature, the following conditions must be met:

base encoding must be set with the encoding parameter of the [format] configuration section;
encoding for the particular headed text file must be explicitly specified with the encoding: header field;
the two encodings must differ and both must be supported by Thalassa.

If any item in this list is not met, Thalassa will silently decide to do no encoding converion.

Currently, only the following encodings are supported: utf8, ascii, koi8-r, cp1251. Thalassa recognizes the following synonims for the supported encodings: utf-8, us-ascii, koi8r, koi8, 1251, win1251, win-1251, windows-1251. Names are case-insensitive.

It is relatively easy to add more encodings; this involves making tables to convert from that encoding to unicode and back, and adding them to the source code of the stfilter library used by Thalassa. If you need support for an encoding not listed above, please contact the author.

When a character encountered in the source being converted is not representable in the target encoding, Thalassa replaces it with the appropriate HTML4 entity, such as « or —, if there's such entity for the particular character in HTML4 (not HTML5). If no appropriate entity exists, the character is replaced with HTML hexadecimal unicode entity, like Ы.

Content formats

Thalassa is designed to pass the body of a headed text file through filters that somehow adjust the markup, that is, convert the content from one format to another. Format of the body (and possibly some of the header fields) is indicated by the format: header field; the value of this field is expected to be a comma-separated list of tokens designating centain aspects of the format, or, strictly speaking, of the desired conversion.

The default format is verbatim, which means the body is already in HTML and needs no conversion at all. No conversion will be performed as well in case the format: header is missing, empty, or contains no recognized tokens; it is, however, strongly recommended to specify the verbatim format explicitly as the defaults may change in the future.

The recognized format tokens are verbatim, breaks, texbreaks, tags and web.

The web token was initially intended to indicate that URLs and email addresses encountered in the text should be automatically turned into clickable web links. As of present, this conversion is not implemented, and the token is silently ignored by Thalassa.

The tags token means that HTML tags must be stripped off the text being converted, with exception for the tags explicitly allowed for user-supplied content. The list of allowed tags is set by tags parameter in the [format] configuration section.

Both breaks and texbreaks turn on conversion of newline characters into paragraph breaks, but in different manner. The filter enabled by texbreaks works in TeX/LaTeX style: it takes empty lines (or, strictly speaking, lines that contain nothing but possibly whitespace) as paragraph breaks, and wraps continuous text fragments (those not containing empty lines) with <p> and </p>. This mode is suitable if you use a text editor that wraps long lines, such as vim. In particular, in all source files of this documentation the texbreaks token is used.

The breaks token does essentially the same, but in addition it replaces every lonely newline char (that is, a newline char which is not followed by an empty line) with <br />. This mode is inconvenient for files prepared manually in a text editor, but is useful for texts received through web-forms, such as user comments.

In both modes the filter is aware of some HTML/XHTML basics (but in a very basic level): it knows that some tags are not suitable inside paragraphs, and that the paragraph conversion shouldn't be done within some tags. However, the filter is not a real (X)HTML parser and has no idea about all these DTD schemes. It simply uses a list of “special” tags: pre, ul, ol, table, p, blockquote, h1, h2, h3, h4, h5, h6. When any of these tags starts, the filter closes the current paragraph, if it is open, and doesn't start any new paragraphs until the tag closes. The list is hard-coded. It is very possible this will change in future versions, as all this is, well, just a hack which solves certain problems arised here and now.

Anyway, tags that are “special” anyhow in respect to paragraphs, and are not on the list above, shouldn't perhaps be allowed for user-supplied content. As of content created by the site's owner or administrator, it can always be checked for results of conversion.